Maybe just a matter of familiarity?
They're both extremely useful when, say, building a parser, but when the kind of code you write involves displaying custom widgets, taking data and pushing it onto a queue, looking up data in a database, etc. integration tests have a lot more bang for the buck.
It has some use if you build something like a complex pricing engine, numerical code or a parser for a mini DSL. I find that problems of this type don't crop up a lot though.
I've used it for things like "regardless of where you are on the page, tab n times and shift tab n times leaves you on the original item".
I found a bug in our tv ui library which was actually a bug in the spec. Regardless of how you built the ui, if you press a direction and focus moves, pressing the opposite direction takes you back - but we had another rule that broke this. We had tests for both, and it was only when I made the general test (for all ui, items in it and directions) it found the inconsistency.
It was also pretty easy to write.
I've also found issues around text processing due to lowercasing not always resulting in the same length string and more. I found a bug demoing pbt for a contact gig I was going for that was around some versioning.
To be honest I've never implemented it, even for a demo, and not found a bug.
Me too. I tend to roll my own property testers / fuzzers per project instead of using a library. But my experience is similar to yours. Out of maybe 25 testers, I think the only times I didn’t find any bugs was when I messed up the tests themselves.
It’s incredibly humbling work.
> It’s incredibly humbling work
Absolutely. Glad others feel it too.
I almost feel like you need a decent reason why not to use them in places.
Ideally to can describe the behaviour of something in general terms around what stays true. If you can't, is it too hard for people to understand?
I will however say I don't usually write pbts but have a strong feeling I should add more.
Hypothesis can use a fuzzer as its input source, but it's marked as experimental.
It doesn't seem like "proper" fuzzers are used very often with Haskell (e.g. there's an example at https://www.tweag.io/blog/2023-06-15-ghc-libfuzzer/ ), though many people seem to have hooked up the HPC code coverage tool to property checkers, e.g.
https://www.cse.iitk.ac.in/users/karkare/MTP/2011-12/subhash...
One simple example (from the above) is "sin (π − x) = sin x" for the implementation of the sine function not requiring the knowledge of its specific output values. Naturally, instead of the literal equality "=" one can use a more appropriate accuracy specification as in, say, relative ulp (https://en.wikipedia.org/wiki/Unit_in_the_last_place) error bound, cf. https://members.loria.fr/PZimmermann/papers/accuracy.pdf
> An individual company having specific needs isn't at odds with a general statement being broadly true.
Google needs certain things more than reduced carbon emissions, and Jane Street needs certain things more than whatever else they could spend that dev time on.
Why is PBT useful at Jane Street, at least more than in other places? Is it the use of functional language? Average Jane Street dev being more familiar with PBT? Is the domain particularly suited to this style of testing?
Explicitly, my claim is that the biggest bottleneck is education on how to use PBT effectively and Jane Street is not using them to get an extra mile in safety, they use it because it's the easiest way to write large chunk of the tests.
Because trading firms write a lot more algorithmic code than most businesses. Trading strategy code is intensely algorithmic and calculation heavy by its very nature as is a lot of the support code written around it.
At least, that's what it was like when I worked in a trading firm. Relatedly, it was one of the few projects Id worked on where having 95% unit tests and 5% integration tests made perfect sense. It fitted the nature of the code, which wasnt typical of most businesses.
Somebody else wrote that they wrote a lot of numerical code in another business for which property testing is extremely useful and again, I dont doubt that either. 95% is still != 100% though.
Barring that though, capnproto and flatbuffers (perhaps with compression on slow networks) are usually faster than protos. Other people have observed that performance deficit on many occasions and made smaller moderately general-purpose libraries before too (like SBE). They all have their own flavors of warts, but they're all often much faster for normal use cases than protos.
As a hybrid, each project defining its own (de)serializer library can work well too. I've done that a few times, and it's pretty easy to squeeze out 10x-20x throughput for the serialization features your project actually needs while still only writing the serialization crap once and reusing it for all your data types.
Recapping on a few reasons why protos are slow:
- There's a data dependency built into the wire format which is very hard to work around. It blocks nearly all attempts at CPU pipelining aND vectorization.
- Lengths are prefixed (and the data is variable-length), requiring (recursively) you to serialize a submessage before serializing its header -- either requiring copies or undersized syscalls.
- Fields are allowed to appear in any order, preventing any sort of code which might make the branch predictor happy.
- Some non-"zero-copy" protocols are still quite fast since you can get away with a single allocation. Since several decisions make walking the structure slow, that's way more expensive that it should be for protos, requiring either multiple (slow) walks or recursive allocations.
- The complexity of the format opens up protos to user error. Nonsense like using a 10-byte slow-to-decode-varint for the constant -1 instead of either 1, 4, or 8 fast-to-decode bytes (which _are_ supported by the wire format, but in the wild I see a lot of poorly suited proto specs).
- The premise in the protocol that you'll decode the entire type exactly as the proto defines prevents a lot of downstream optimizations. If you want a shared data language (the `.proto` file), you have to modify that language to enforce, e.g., non-nullability constraints (you'd prefer to quickly short-circuit those as parse errors, but instead you need extra runtime logic to parse the parsed proto). You start having to trade off reusability for performance.
And so on. It's an elegant format that solves some real problems, but there are precious few cases where it's a top contender for performance (those cases tend to look like bulk data in some primitive type protos handle well, as opposed to arbitrary nesting of 1000 unrelated fields).
Specific languages might have (of course) failed to optimize other options so much that protos still win. It sounds like you're using golang, which I've not done much with (coming from other languages, I'm mildly surprised that messagepack didn't win any of your measurements), and by all means you should choose tools based on the data you have. My complaints are all about what the CPU is capable of for a given protocol, and how optimization looks from a systems language perspective.
One way to look at the problem is to go build a sufficiently smart generic protocol and write down everything that's challenging to support in v1. You have tradeoffs between size (slow for slow networks), data dependencies (slow for modern CPUs), lane segmentation (parallel processing vs cache-friendly single-core access vs code complexity), forward/backward compatibility, how much validation should the protocol do, .... Any specific data serialization problem usually has some outside knowledge you can use to remove or simplify a few of those "requirements," and knowledge of the surrounding system can further guide you to have efficient data representations on _both_ sides of the transfer. Code that's less general-purpose tends to have more opportunities fore being small, fast, and explainable.
A common source of inefficiencies (protobuf is not unique in this) is the use of a schema language in any capacity as a blunt weapon to bludgeon the m x n problem between producers and consumers. The coding pattern of generating generic producers/consumers doesn't allow for fine-tuning of any producer/consumer pair.
Picking on flatbuffers as an example (I _like_ the project, but I'll ignore that sentiment for the moment), the vtable approach is smart and flexible, but it's poorly suited (compared to a full "parse" step) to data you intend to access frequently, especially when doing narrow operations. It's an overhead (one that reduces the ability for the CPU to pipeline your operations) you incur precisely by trying to define a generic format which many people can produce and consume, especially when the tech that produces that generic format is itself generic (operating on any valid schema file). Fully generic code is hard enough to make correct, much less fast, so in the aim of correctness and maintainability you usually compromise on speed somewhere.
For that (slightly vague) flatbuffers example, the "purpose-built protocol" could be as simple as almost anything else with a proper parse step. That might even be cap'n proto, though that also has problems in certain kinds of nested/repeated structures because of its arena allocation strategy (better than protobuf, but still more allocations and wasted space than you'd like).
Edit: also a numerical domain, which is the easiest type to use them for in my experience!
I can assure you they arent.
Wasnt the case in the trading firm I worked at.
Do you work at Jane Street? Have you worked elsewhere?
Yes, it's a lot of work coming up with good properties, but it massively helps to find gaps in the domain logic. In my experience, these gaps are what's typically expensive, not the weird problem a junior had with properly using Redis or S3.
Im not an amateur.
The only time I dont do this is when there literally is no domain logic yet (e.g. a CRUD app).
>In my experience, these gaps are what's typically expensive, not the weird problem a junior had with properly using Redis or S3.
What can I say? Your experience might not be as broad as mine.
Redis is a source of almost no bugs because it is very well designed, but most interfaces I couple to have design qualities that are the exact opposite of redis's.
Those interfaces (e.g. wonky payment gateway APIs, weird microservice APIs) are the probably source of most bugs in enterprise systems I work on.
#2 is probably simple misspecifications (customer said code should do X, it should actually do Y which is almost the same but very slightly different).
#3 would be domain logic errors. And even most of those are uncovered and avoided with saner architecture or a couple of unit tests.
For the parsers I write at home, sure, property testing kicks ass. For your college degree algorithm coursework, sure, it helps a lot. For 95% of business logic? Pointless, the complexity isnt buried deep in the business logic.
I don't do much mocking either, because I also prefer integration tests for these types of applications. Here is an example that tests against a database, all the way up to the state management layer of the UI: https://concerningquality.com/model-based-testing/.
In fairness, people don't write about doing this a lot. PBT literature seems to revolve around sorting functions or binary tree operations. But PBT much more powerful than just that.
Users care quite a lot when these things break.
>They care about business functionality, which is quite easily expressible in model form.
They care about reliable applications that do the task they want. Users don't have a concept of "business functionality". If a widget they wanted to use is broken, they'll be pissed - they won't tell you that it's ok because the concept of the application was still good.
From your link:
>The main thing that I want to highlight here is how simple the model is. This is pure domain logic, and is probably as close to the essential complexity of the application that we can get
This is exactly the issue.
If I go down my recent list of bugs and categorize them, ~8 out of 10 of them are in areas which the author would have *abstracted away*. The other two probably could have been avoided by having a team member write one test to cover that functionality instead of zero.
What does it mean when you abstract something away from a test? Well, it means the test won't catch a bug in that thing.
If I went down my recent list of bugs and been able to see that 9 out of 10 of them would have been caught via adding property testing/model testing on top of regular testing/typing, etc. I'd have absolutely no second thoughts about implementing it but that simply isn't the case.
I have worked on problems in (not usual) environments where it is the case, and in those rare cases I have advocated for and used property testing but I have more often seen it advocated and used by people who see every bug as a nail and property testing as some kind of super advanced hammer.
Users care when their expected behavior breaks. They certainly do not care why it broke, or _where_ it broke. Most users don't know what a database is.
> If I go down my recent list of bugs and categorize them, ~8 out of 10 of them are in areas which the author would have abstracted away.
Even though the model is abstract, all of the behavior of the implementation is covered in a model-based test. For example, let's say a SQL query is what's broken for some use case. This query would be exercised in the test, and it wouldn't pass until the query is fixed. That's the main reason I prefer model-based integration tests - there are no such bugs that escape.
Also, I still write targeted tests. There's nothing that prevents you from doing that, and it absolutely is useful in many cases. I just believe that most tests should be generated, and hand-written tests should be the minority.
https://www.cs.dartmouth.edu/~doug/nfa.pdf gives an nfa variant; extending to a pda is an (interesting, I found) exercise.
In principle, the same framework would extend to parsers too (atto/mega/parsec, etc.), though parsing is often a bottleneck in real applications, so many would avoid such abstraction/indirection. Still, it would be nice if libraries could define parsers for their datatypes, in such a way that it's easily reusable as a generator in tests (without having the library package depending on any test-generator packages).
They also demonstrate finding the "derivative" of a generator (i.e. the result of fixing its next choice to a particular outcome), which is useful for avoiding choices that cause preconditions to fail (something which many PBT users seem to struggle with: assuming that the framework will somehow avoid problematic values, rather than just rejection-sampling them as most do).
It reads JSON data (described in the paper), which can be generated during property checking, e.g. using https://github.com/tyche-pbt/tyche-quickcheck or a Hypothesis equivalent.
I have also used Haskell's `QuickCheck` and Clojure's `spec` / `test.check` and have had a great experience with these. In my experience they "just work".
Conversely, if you're trying to generate non-trivial datasets, you will likely run into situations where your specification is correct but Hypothesis' implementation fails to generate data, or takes an unreasonable amount of time to generate data.
Example: Generate a 100x25 array of numeric values, where the only condition is that they must not all be zero simultaneously. [1]
[1] https://github.com/HypothesisWorks/hypothesis/issues/3493
Silly idea for your generator would to generate an array, and if it's zero... draw a random index and a random non-zero number and add it into the array. Leads to some weird non-convexity properties but is a workable hack.
In your own example you turned off the "data too slow" issue, probably because building up a dataframe (all to just do a column sum!) is actually kind of costly at large numbers! Your complaint is probably actually meant for the pandas extras (or pandas itself) rather than the concept of hypothesis.
But! Even though it doesn't even get that much slower at a certain number of rows it just starts hanging! Like at 49 rows everything is still fine and at 50 it no longer wants to work. It's very bizarre and I'll see if I can debug it. But I think your test case isn't indicative of some fundamental issue with Hypothesis rather than some sort of bug.
Let me know if you have a reproducer, I'd be curious to take a look.
Yes, this brings back memories. I've definitely seen this kind of behaviour as well, in many different, not-particularly-exotic, situations.
I am absolutely convinced the issue I raised on the github project was a bug or a defect, despite the maintainers not taking it seriously.
I find QuickCheck and Clojure spec/test.check much more straightforward to use. I just never ran into this sort of thing with these other tools.
(a) Filtering is a last resort and is best avoided. As an example, the Gen type in Haskell's falsify package can't be filtered, since it's a bad idea. As another example, ScalaCheck's Gen type can be filtered, but they also allow "retries" (by default, up to 10,000 times), because filtering is very wasteful.
(b) If you're going to filter, scope it to be as small as possible (e.g. one comment points out that you're discarding and regenerating entire dataframes, when the filter only depends on one particular column)
(c) Have some vague awareness of how your generators will shrink, to avoid infinite loops. In your case, shrinking will make it more likely to fail your filter; and the "smallest" dataframe (all zeros) will definitely fail.
My initial impulse is to pick a random cell which must not be zero, generate a random number for each other cell and a random non-zerp number for that one. I'm not immediately decided on whether it's uniformly distributed.
Any algorithm that cares about the number of non-zeros could have non-trivial interactions with their arrangement and count, so picking something that generates non-trivial sparsity (and doesn't just make the array look like white noise) is going to have the best chance of exposing interesting behavior. The tricky part is thinking through how to generate "interesting" patterns, which admittedly I haven't put enough thought into.
Right! And even if it were, in the sense that that's what we should expect as real world input, it wouldn't generally be the best distribution for finding bugs.
Haskell's "falsify" package takes a similar approach, but uses a tree of random values. This has the advantage that composite generators can run each of their parts against a different sub-tree, and hence they can be shrunk independently without interfering.
ISTM that the benefits and drawbacks are just two sides of the same coin. The benefit is reduced testing effort. The drawback is that savings comes from sacrificing knowledge of exactly what has been tested.
> I also observed that PBT users are not always good at evaluating whether their testing was effective.
Ah, there's the rub. It seems that the benefit of making the tested property explicit is partially offset by not really knowing how much of the search space was explored automatically.
I wouldn't agree with this assessment. PBT has frequently surprised me, finding bugs that I never would have thought to test. It's not merely a tool for generating the unit tests we would have written anyway; it's a way to write down how we think the parts of a system interact, and have the machine show us that we're wrong. Crucially, it's quite hard to prove ourselves wrong, since it's hard to think up scenarios that we hadn't considered, so it's nice to have the machine help with that part.
> The drawback is that savings comes from sacrificing knowledge of exactly what has been tested.
I encourage you to use the statistics-gathering features that many (most?) PBT systems provide (e.g. functions with names like "collect", "label", etc.). Also, whilst preconditions/filtering can be useful in a pinch, it's usually a good idea for a property to construct data of the appropriate sort itself (e.g. using arbitrary arguments to fill in an appropriate template); rather than relying on a completely arbitrary generator to hit scenarios you care about.
I do think that immutable by default OCaml + good PBT tooling there helps a lot.
(+ x y)
should compute the same thing as
(the <type1> (+ (the <type2> x) (the <type3> y)))
where the types are randomly generated types that contain the values in question.
Hypothesis has a nice feature which uses a hash of each test's source code as the PRNG seed for that test, which ensures the same examples are checked every time (unless someone changes the text of that particular test). It's a nice idea, but might be hard to implement in compiled languages (like Haskell, for example).