`falsify`: Hypothesis-inspired shrinking for Haskell

Edsko de Vries – Tuesday, 18 April 2023

Consider this falsify property test that tries to verify the (obviously false) property that all elements of all lists of up to 10 binary digits are the same (we will explain the details below; hopefully the intent is clear):

prop_list :: Property ()
prop_list = do
    n  <- gen $ Gen.integral $ Range.between (0, 10)
    xs <- gen $ replicateM n $ Gen.int $ Range.between (0, 1)
    assert $ P.pairwise P.eq .$ ("xs", xs)

we might get a counter-example such as this:

failed after 9 shrinks
(xs !! 0) /= (xs !! 1)
xs     : [0,1]
xs !! 0: 0
xs !! 1: 1

More interesting than the counter-example itself is how falsify arrived at that counter-example; if we look at the shrink history (--falsify-verbose), we see that the list shrunk as follows:

   [1,1,0,1,0,1]
~> [1,1,0]       -- shrink the list length
~> [0,1,0]       -- shrink an element of the list
~> [0,1]         -- shrink the list length again

The test runner is able to go back and forth between shrinking the length and the list, and shrinking elements in the list. That is, we have integrated shrinking (like in hedgehog: we do not specify a separate generator and shrinker), which is internal: works across monadic bind. The Python Hypothesis library showed the world how to achieve this. In this blog post we will introduce falsify, a new library that provides property based testing in Haskell and has an approach to shrinking that is inspired by Hypothesis. As we shall see, however, the details are quite different.

Background

In this first section we will discuss some of the background behind falsify; the next section will be a more tutorial-style introduction on how to use it. This section is not meant to an exhaustive discussion of the theory behind falsify, or how the theory differs from that of Hypothesis; both of those topics will be covered in a paper, currently under review. However, a basic understanding of these principles will help to use the library more effectively, and so that will be our goal in this first section.

Unit testing versus property based testing

In unit testing (for example using tasty-hunit), a test for a function f might look something like this:

test :: Assertion
test =
    unless (f input == expected) $
      assertFailure "not equal"

That is, we apply f to specific input, and then verify that we get an expected result. By contrast, in property based testing, we do not specify a specific input, but instead generate a random input using some generator genInput, and then verify that the input and the output are related by some property prop:

test_property :: Property ()
test_property = do
    input <- gen $ genInput
    unless (prop input (f input)) $
      testFailed "property not satisfied"

This blog post is not intended as an introduction to property-based testing; merely observe that generation of input values is a critical ingredient in property based testing. However, if you are not familiar with the topic, or not yet convinced that you should be using it, I can highly recommend watching Testing the Hard Stuff and Staying Sane, and then reading How to Specify It!: A Guide to Writing Properties of Pure Functions, both by the world’s foremost property-based testing guru John Hughes.

The importance of shrinking

Suppose we want to test the (false) property that for all numbers x and y, x - y == y - x:

prop_shrinking :: Property ()
prop_shrinking = do
    x <- gen $ Gen.int $ Range.between (0, 99)
    y <- gen $ Gen.int $ Range.between (0, 99)
    unless (x - y == y - x) $
      testFailed "property not satisfied"

Since the property is false, we will get a counter-example. Without shrinking, one such a counter-example might be

x = 38
y = 23

However, although that is indeed a counter-example to the property, it’s not a great counter-example. Why these specific numbers? Is there something special about them? To quote John Hughes in Experiences with QuickCheck: Testing the Hard Stuff and Staying Sane:

Random tests contain a great deal of junk—that is their purpose! Junk provokes unexpected behaviour and tests scenarios that the developer would never think of. But tests usually fail because of just a few features of the test case. Debugging a test that is 90% irrelevant is a nightmare; presenting the developer with a test where every part is known to be relevant to the failure, simplifies the debugging task enormously.

(emphasis mine). For our example, the numbers 38 and 23 are not particularly relevant to the failure; with shrinking, however, the counter-example we will get is

x = 0
y = 1

Indeed, this is the only counter-example we will ever get: 0 is the smallest number possible (“least amount of detail”), and the only thing that is relevant about the second number is that it’s not equal to the first.

Parsing versus generation

Generation of inputs relies on pseudo-random number generators (PRNGs). The typical interface to a PRNGs is something like this:

nextSample :: PRNG -> (Word, PRNG)

Given such an interface, we might define the type of generators as

newtype Gen a = Gen (PRNG -> (a, PRNG))

This covers generation, but not shrinking. The traditional approach in QuickCheck to shrinking is to pair a generator with a shrinking function, a function of type

shrink :: a -> [a]

This works, but it’s not without its problems; see my blog post Integrated versus Manual Shrinking for an in-depth discussion. The key insight of the Hypothesis library is that instead of shrinking generated values, we instead shrink the samples produced by the PRNG. Suppose we unfold a PRNG to a stream of random samples:

unfoldLinear :: PRNG -> [Word]
unfoldLinear prng =
    let (s, prng') = next prng
    in s : unfoldLinear prng'

Then we can shift our perspective: rather than thinking of generating random values from a PRNG we instead parse this stream of random samples:

newtype Parser a = Parser ([Word] -> (a, [Word])

Instead of having a separate shrinking function, we now simply shrink the list of samples, and then re-run the parser. This is the Hypothesis approach in a nutshell; parsers of course need to ensure that the produced value shrinks as the samples are shrunk. For example, here is a very simple (proof of concept) generator for Bool:

parseBool :: Parser Bool
parseBool = Parser $ \(s:ss) -> (
    if s >= maxBound `div` 2 then True else False
  , ss
  )

Assuming that the sample is chosen uniformly in the full Word range, this parser will choose uniformly between True and False; and as the sample is shrunk towards zero, the boolean will shrink towards False.

Streams versus trees

If you look at the definition of Gen in QuickCheck, you will see it’s actually different to the definition we showed above:

newtype Gen a = Gen (PRNG -> a)

Like our definition above, this generator takes a PRNG as input, but it does not return an updated PRNG. This might seem confusing: suppose we are generating two numbers, as in our example above; how do we ensure those two numbers are generated from different PRNGs?

To solve this problem, we will need a PRNG that in addition to next, also provides a way to split the PRNG into two new PRNGs:

next  :: PRNG -> (Word, PRNG)  -- as before
split :: PRNG -> (PRNG, PRNG)  -- new

Then to run two generators, we first split the PRNG:

both :: QcGen a -> QcGen b -> QcGen (a, b)
both (QcGen g1) (QcGen g2) = QcGen $ \prng ->
    let (l, r) = split prng
    in (g1 l, g2 r)

The advantage of this approach is laziness: we can produce the second value of type b without generating the value of type a first. Indeed, if we never demand the value of a, we will not generate it at all! This is of critical importance if we have generators for infinite values; for example, it is what enables us to Generate Functions.

The `falsify` definition of `Gen`

If we apply the insight from Hypothesis (that is, parse samples rather than generate using PRNGs) to this new setting where splitting PRNGs is a fundamental operation, we arrive at the definition of Gen in falsify. First, unfolding a PRNG does not give us an infinite stream of samples, but rather an infinite tree of samples:

data STree = STree Word STree STree

unfold :: PRNG -> STree
unfold prng =
    let (s, _) = next  prng
        (l, r) = split prng
    in STree s (unfold l) (unfold r)

A generator is then a function that takes a part of a sample tree, parses it, and produces a value and an updated sample tree:

newtype Gen a = Gen (STree -> (a, [STree]))

This does not reintroduce dependencies between generators: each generator will be run against a different subtree, and update only that subtree. For example, here is how we might run two generators:

both :: Gen a -> Gen b -> Gen (a, b)
both (Gen g1) (Gen g2) = Gen $ \(STree s l r) ->
    let (a, ls) = g1 l
        (b, rs) = g2 r
    in ( (a, b)
       ,    [STree s l' r  | l' <- ls]
         ++ [STree s l  r' | r' <- rs]
       )

Note that we are focussing on the core concepts here, and are glossing over various details. In particular, the actual definition in falsify has an additional constructor Minimal, which is a finite representation of the infinite tree that is zero everywhere. This is a key component in making this work with infinite data structures; see upcoming paper for an in-depth discussion. Users of the library however generally do not need to be aware of this (indeed, the sample tree abstraction is not part of the public API).

Consequences of using sample trees

Arguably all of the key differences between Hypothesis and falsify stem from the difference in representation of samples: a linear stream in Hypothesis and an infinite tree in falsify. In this section we will discuss two consequences of this choice.

Shrinking the sample tree

First, we need to decide how to shrink a sample tree. In Hypothesis, the sample stream (known as a “choice sequence”) is subjected all kinds of passes (15 and counting, according to Test-Case Reduction via Test-Case Generation: Insights from the Hypothesis Reducer), which shrink the sample stream according to lexicographical ordering; for example:

..¸ x, ..        < .., x', ..       -- shrink an element (x' < x)
.., x, y, z, ..  < .., x, z, ..     -- drop an element from the stream
.., x, y, z, ..  < .., y, z, x, ..  -- sort part of the stream (y < z < x)

When we are dealing with infinite sample trees, such a total ordering does not exist. For example, consider the following two trees:

tree1 = STree ..         tree2 = STree ..
         (STree 1 ..)              (STree 2 ..)
         (STree 4 ..)              (STree 3 ..)

Sample 1 in tree1 is less than the corresponding sample 2 in tree2, but sample 4 in tree1 is greater than the corresponding sample 3 in tree2. Hence, we have neither tree1 < tree2 nor tree2 < tree1: these two trees are incomparable. Instead, falsify works with a partial ordering; instead of the multitude of shrinking passes of Hypothesis, falsify has precisely one pass¹: shrink an individual sample in the tree.

Distributing samples to parsers

When we have a stream of values that we need to use for multiple parsers, we need to decide which samples go to which parser. In Hypothesis, this essentially happens on a first-come-first-served basis: any samples left unused by the first parser will be used by the next. As discussed, falsify parsers do not return “samples left unused”. Instead, the sample tree is split each time we compose parsers, like we did in both, shown above. In practice, this happens primarily when using applicative <*> or monadic >>=.

Predictability

These two differences are rather technical in nature; how do they affect users? Suppose we have a generator that produces a list and then a number:

listThenNum :: Gen ([Bool], Int)
listThenNum = do
    xs <- Gen.list ..
    n  <- Gen.int  ..
    return (xs, n)

If we are using a stream of samples, Hypothesis style, and then drop a random sample from that stream, the generator for int might suddenly be run against an entirely different sample; it might increase in value! Similarly, if we run that int generator against the first sample left over by the list generator, and if that list generator uses fewer samples as it shrinks, we might also run int against an unrelated sample, and its value might again increase.

This is not necessarily a problem; after all, we can then start to decrease that new int value again. However, that is only possible if the generated value with the larger int is still a counter-example to whatever property is being tested. If that is not the case, then we might not be able to shrink the list, and we might end up with a non-minimal counter-example. That can make debugging more difficult (we haven’t gotten rid of all the “junk”), and it can be difficult for users to understand why this might not shrink any further; even if the library offers facilities for showing why shrinking stopped (for example, showing which shrunk examples were rejected; verbose mode in falsify), it can still be quite puzzling why the library is trying to increase a value during shrinking.

Neither of these problems can arise in falsify: it never drops samples at all (instead, only shrinking individual samples), and since monadic bind splits the sample tree, we are guaranteed that the behaviour of int is entirely unaffected by the behaviour of list. This makes the shrinking behaviour in falsify more predictable and easier to understand.²

Monadic bind

We mentioned above that QuickCheck’s approach to shrinking has its problems, without going into detail about what those problems are. Instead, we referred to the blogpost Integrated versus Manual Shrinking; this blog post discusses not only the problems in QuickCheck, but also shows one alternative approach, known as integrated shrinking, used by QuviQ QuickCheck and made popular in the Haskell world by the library hedgehog.

The problem with integrated shrinking is that it does not work across monadic bind. The linked blogpost explains this in great detail, but the essence of the problem is not hard to see. Consider the type of monadic bind:

(>>=) :: Gen a -> (a -> Gen b) -> Gen b

We cannot shrink the right hand side of (>>=) independent from the left hand side, because the right hand side is not a generator. We only have a generator once we apply the supplied function to the result of the first generator. This means that we cannot shrink these two generators independently: if, after shrinking the right hand side, we go back and then shrink the left hand side, we get an entirely different generator, and the shrinking we did previously is just wasted.

In practice what this means is that once we start shrinking the right hand side, we will never go back anymore and shrink the left hand side. In the example from the introduction we first generated a list length, and then the elements of the list:

prop_list :: Property ()
prop_list = do
    n  <- gen $ Gen.integral $ Range.between (0, 10)
    xs <- gen $ replicateM n $ Gen.int $ Range.between (0, 1)
    assert $ P.pairwise P.eq .$ ("xs", xs)

With integrated shrinking, once we start shrinking elements from the list, we will never go back anymore and shrink the list length. With internal shrinking, however, we can go back and forth across monadic bind. This is the raison d’être of internal shrinking: it doesn’t matter that we cannot shrink the two generators independently, because we are not shrinking generators! Instead, we just shrink the samples that feed into those generators.

Selective functors

It is important to understand the limitations of internal shrinking: it is certaintly not a silver bullet. For example, consider this combinator that takes two generators, flips a coin (generates a boolean, shrinking towards True), and then executes one of the two generators:

choose :: Gen a -> Gen a -> Gen a -- Suboptimal definition
choose g g' = do
    b <- Gen.bool True
    if b then g else g'

This combinator works, but it’s not optimal. Suppose the initial value of b is False, and so we use g'; and let’s suppose furthermore that we spend some time shrinking the sample tree using g'. Consider what happens if b now shrinks to True. When this happens we will now run g against the sample tree as it was left after shrinking with g'. Although we can do that, it very much depends on the specific details of g and g' whether it’s useful to do it, and we will certainly lose the predictability we discussed above.

We could try to make the two generators shrink independent from each other by simply running both of them, and using the boolean only to choose which result we want. After all, Haskell is lazy, and so this should be fine:

choose :: Gen a -> Gen a -> Gen a -- Bad definition!
choose g g' = do
    x <- g
    y <- g'
    b <- Gen.bool True
    return $ if b then x else y

While is is true that generation using this definition of choose will work just fine (and laziness ensures that we will in fact only run whatever generator is used), this combinator shrinks very poorly. The problem is that if we generate a value but they don’t use it, the (part of) the sample tree that we used to produce that value is irrelevant, and so by definition we can always replace it by the sample tree that is zero everywhere. This means that if we later want to switch to that generator, we will only be able to do so if the absolute minimum value that the generator can produce happens to work for whatever property we’re testing. This is an important lesson to remember:

Do not generate values and then discard them: such values will always shrink to their minimum. (Instead, don’t generate the value at all.)

To solve this problem, we need to make it visible to the library when we need a generator and when we do not, so that we it can avoid shrinking that part of the sample tree while the generator is not in use. Selectively omitting effects is precisely what selective applicative functors give us. A detailed discussion of this topic would take us well outside the scope of this blog post; in the remainder of this section we will discuss the basics only.

Gen is a selective functor, which means that it is an instance of Selective, which has a single method called select:

select :: Gen (Either a b) -> Gen (a -> b) -> Gen b

The intuition is that we run the first generator; if that produces Left a, we run the second generator to get a b; if the first generator produces Right b, we skip the second generator completely. Like for applicative <*> and monadic (>>=), the two generators are run against different subtrees of the sample tree, but the critical difference is that we will not try to shrink the right subtree for the second generator unless that generator is used.

If that all sounds a bit abstract, perhaps suffices to say that any selective functor supports

ifS :: Selective f => f Bool -> f a -> f a -> f a

which we can use to implement choose in a way that avoids reusing the sample tree of the first generator for the second:

choose :: Gen a -> Gen a -> Gen a
choose = ifS (bool True)

Indeed, this is precisely the definition in the falsify library itself.

Tutorial

With the background out of the way, let’s now consider how to actually use the library. Probably the easiest way to get started is to use the tasty integration. Here is a minimal template to get you started:

module Main (main) where

import Test.Tasty
import Test.Tasty.Falsify

main :: IO ()
main = defaultMain $ testGroup "MyTestSuite" [
      testProperty "myFirstProperty" prop_myFirstProperty
    ]

prop_myFirstProperty :: Property ()
prop_myFirstProperty = return ()

This depends on tasty package, as well as falsify of course. If you want, you can also use the Test.Falsify.Interactive module to experiment with falsify in ghci.

Getting started

Suppose we want to test that if we multiply a number by two, the result must be even. Here’s how we could do it:

prop_multiply2_even :: Property ()
prop_multiply2_even = do
    x <- gen $ Gen.int $ Range.withOrigin (-100, 100) 0
    unless (even (x * 2)) $ testFailed "not even"

Some observations:

Property is a monad, so the usual combinators (such as unless) for monads are available
gen runs a generator, and adds the output of the generator to the test log. (The test log is only shown when the property fails.)
Gen.int is an alias for Gen.integral, which can produce values for any Integral type. There is no analogue of QuickCheck’s Arbitrary class in falsify: like in hedgehog and in Hypothesis, every generator must be explicitly specified. For a justification of this choice, see Jacob Stanley’s excellent Lambda Jam 2017 presentation Gens N’ Roses: Appetite for Reduction (Jacob is the author of hedgehog).
The specified Range tells the generator two things: in which range to produce a value, and how to shrink that value. In our example, withOrigin takes an “origin” as explicit value (here, 0), and the generator will shrink towards that origin.
testFailed is the primitive way to make a test fail, but we shall see a better way momentarily.

Predicates

Suppose we mistakingly think we need to multiply a number by three to get a even number:

prop_multiply3_even :: Property ()
prop_multiply3_even = do
    x <- gen $ Gen.int $ Range.withOrigin (-100, 100) 0
    unless (even (x * 3)) $ testFailed "not even"

If we run this test, we will get a counter-example:

multiply3_even: FAIL
  failed after 14 shrinks
  not even
  Logs for failed test run:
  generated 1 at CallStack (from HasCallStack):
    gen, called at demo/Demo/Blogpost.hs:217:10 in main:Demo.Blogpost

This counter-example is not awful: it gives us the counter-example (1), and that counter-example is minimal. We can however do much better; the idiomatic way in falsify to test properties of values is to use a Predicate. A predicate of type

Predicate '[a, b, c, ..]

is essentially a function

a -> b -> c -> .. -> Bool

but in such a way that it can produce a meaningful message if the predicate is not satisfied. Here’s how we might use it for our example:

prop_multiply3_even_pred :: Property ()
prop_multiply3_even_pred = do
    x <- gen $ Gen.int $ Range.withOrigin (-100, 100) 0
    assert $ P.even `P.dot` P.fn ("multiply3", (* 3)) .$ ("x", x)

Some comments:

P.even, like even from the prelude. is a predicate that checks its argument is even
P.dot, like (.) from the prelude, composes a predicate with a function. In addition to the function itself, you also specify the name of the function, so that that name can be used in error messages.
[(.)][docs:falsify:dollar], like `()` from the prelude, applies a predicate to a named argument.

The use of predicates is not required, but can be very helpful indeed. For our running example, this will produce this test failure message instead:

multiply3_even_pred: FAIL
  failed after 2 successful tests and 13 shrinks
  not (even (multiply3 x))
  x          : 1
  multiply3 x: 3

Ranges, Labelling

We saw the use of withOrigin already, and earlier in this blog post we used between; a generator such as

Gen.integral $ Range.between (10, 100)

will produce a value between 10 and 100 (inclusive), shrinking towards 10; it is also possibly to flip the two bounds to shrink towards 100 instead.

The other very useful Range constructor is skewedBy. A generator such as

Gen.integral $ Range.skewedBy 5 (0, 100)

will produce values between (0, 100), like between does, but skewed towards zero; a negative skew value will instead skew towards 100 (but still shrink towards zero). As an example use case, suppose that for a certain property we need a list of Int and an Int, and sometimes that separate Int should be a member of the list, sometimes not:

prop_skew :: Double -> Property ()
prop_skew skew = do
    xs <- gen $ Gen.list rangeListLen $ Gen.integral rangeValues
    x  <- gen $ Gen.integral rangeValues
    collect "elem" [x `elem` xs]
  where
    rangeListLen, rangeValues :: Range Word
    rangeListLen = Range.between (0, 10)
    rangeValues  = Range.skewedBy skew (0, 100)

This example is a property that always passes, but we use collect to collect some statistics; specifically, in what percentage of the tests x is an element of xs. If we run this with a skew of 0, we might see something like:

100000 successful tests

Label "elem":
   94.6320% False
    5.3680% True

In only 5% of cases the element appears in the list. There are various ways in which we could change that distribution of test data, but the simplest way is simply to generate more values towards the lower end of the range; if we run the test with a skew of 5 we get

Label "elem":
   41.8710% False
   58.1290% True

Generators

Nearly all generators are built using prim as their basic building block, which returns the next sample from the sample tree. Higher-level generators split into two categories: “simple” (non-compound) generators that produce a value given some arguments, and generator combinators that take generators as input and produce new generators. Some important examples in the first category are:

integral, which we discussed already
bool, which produces a Bool, shrinking towards a choice of True or False
elem, which picks a random element from a list, and shuffle, which shuffles a list
etc.

The library also offers a number of generator combinators; here we list the most important ones:

choose we saw when we discussed Selective functors, and chooses (uniformly) between two generators, shrinking towards the first.
list takes a range for the list length and a generator and produces a list of values. Unlike the simple “pick a length and then call replicateM” approach from the example from the introduction, this generator can drop elements anywhere in the list (it does this by using the combinator mark to mark elements in the list; as the marks shrink towards “drop”, the element is removed, up to the specified Range).
frequency, similar to the like-named function in QuickCheck, takes a list of generators and frequencies, and chooses a generator according to the specified frequency. This is another way in which we can tweak the distribution of test data.

The implementation of frequency ensures that the generators can shrink indepedently from each other. This could be defined just using Selective, but for improved performance it makes use of a low-level combinator called perturb; see also bindIntegral, which generalizes Selective bindS, and has significantly better performance than bindS.

Generating functions

One of the most impressive aspects of QuickCheck is that it can generate, show and shrink functions. This is due to a functional pearl by Koen Claessen called Shrinking and showing functions; the presentation is available on YouTube and is well worth a watch. We have adapted the QuickCheck approach (and simplified it slightly) to falsify; the generator is called fun. Here is an example (Fn is a pattern synonym; you will essentially always need it when generating functions):

prop_fn1 :: Property ()
prop_fn1 = do
    Fn (f :: [Int] -> Bool) <- gen $ Gen.fun $ Gen.bool False
    assert $
         P.eq
         `P.on` P.fn ("f", f)
      .$ ("x", [1, 2, 3])
      .$ ("y", [4, 5, 6])

This property says that for any function f :: [Int] -> Bool, if we apply that function to the list [1, 2, 3] we must get the same result as when we apply it to the list [4, 5, 6]. Of course, that is not true, and when we run this test, falsify will give us a counter-example:

failed after 53 shrinks
(f x) /= (f y)
x  : [1,2,3]
y  : [4,5,6]
f x: True
f y: False

Logs for failed test run:
generated {[1,2,3]->True, _->False} at CallStack (from HasCallStack):
  gen, called at demo/Demo/Blogpost.hs:244:32 in main:Demo.Blogpost

Notice the counter-example we get: a function that returns True for the list [1, 2, 3], and False everywhere else.³ It truly is quite astonishing that this works: we can see that the list [1, 2, 3] is special by inspecting the source code, but of course falsify (or indeed, QuickCheck; this is not unique to falsify) cannot! Instead, falsify will generate a random infinitely large description of functions from [Int] -> Bool, that covers all possible input lists, and then start shrinking this description, throwing away values for inputs it doesn’t need, until a minimal test case remains. Truly a testament to the power of laziness in Haskell.

For a more realistic example, let’s port an example from Koen Claessen’s presentation to falsify. This example is testing the (wrong) property that for all functions f and predicates p,

map f . filter p == filter p . map f

In falsify, we might express this as:

prop_mapFilter :: Property ()
prop_mapFilter = do
    Fn (f :: Int -> Int)  <- gen $ Gen.fun genInt
    Fn (p :: Int -> Bool) <- gen $ Gen.fun genBool
    xs :: [Int] <- gen $ Gen.list (Range.between (0, 100)) genInt
    assert $
       P.eq
      `P.split` (P.fn ("map f", map f), P.fn ("filter p", filter p))
      `P.split` (P.fn ("filter p", filter p), P.fn ("map f", map f))
      .$ ("xs", xs)
      .$ ("xs", xs)
  where
    genInt :: Gen Int
    genInt = Gen.int $ Range.between (0, 100)

    genBool :: Gen Bool
    genBool = Gen.bool False

We generate a random function f, a random predicate p, a random list xs, and then assert the property; and of course, falsify will happily give us a counter-example:

failed after 25 shrinks
(map f (filter p xs)) /= (filter p (map f xs))
xs                 : [96]
xs                 : [96]
filter p xs        : [96]
map f xs           : [0]
map f (filter p xs): [0]
filter p (map f xs): []

Logs for failed test run:
generated {_->0} at CallStack (from HasCallStack):
  gen, called at demo/Demo/Blogpost.hs:254:30 in main:Demo.Blogpost
generated {96->True, _->False} at CallStack (from HasCallStack):
  gen, called at demo/Demo/Blogpost.hs:255:30 in main:Demo.Blogpost
generated [96] at CallStack (from HasCallStack):
  gen, called at demo/Demo/Blogpost.hs:256:20 in main:Demo.Blogpost

It generated a function that maps anything to 0, a predicate that is True for 96 and False for everything else, and a list containing only the value 96; this is indeed a nice counter-example, as the output from the assert explains.

Side note: in an ideal world that value 96 would be shrunk too. However, this would require shrinking both the 96 in the generated list and the 96 in the generated predicate at the same time. Like QuickCheck, falsify never takes more than one shrink step at once, to ensure that shrinking is O(n) and avoid exponential explosion. Section Dependencies between commands of my blog post “An in-depth look at quickcheck-state-machine” discusses this kind of problem in the context of quickcheck-state-machine.

Testing shrinking

When we use internal (or indeed, integrated) shrinking, we don’t write a separate shrinking function, but that doesn’t mean we cannot get shrinking wrong. Shrinking never truly comes for free! As a simple example, consider writing a generator that produces any value below a given maximum (essentially, a more limited form of integral). A first attempt might be:

below :: Word64 -> Gen Word64
below n = (`mod` n) <$> Gen.prim

While this generator does in fact produce values in the range 0 <= x < n, it does not shrink very well! As the value produced prim shrinks, the value produced by below will cycle. We can discover this by writing a property that tests the shrinking behaviour of below, using testShrinkingOfGen:

prop_below_shrinking :: Property ()
prop_below_shrinking = do
    n <- gen $ Gen.integral $ Range.between (1, 1_000)
    testShrinkingOfGen P.ge $ below n

This property will fail:

failed after 4 successful tests and 14 shrinks
original < shrunk
original: 0
shrunk  : 1

Logs for failed test run:
generated 2 at CallStack (from HasCallStack):
  gen, called at demo/Demo/Blogpost.hs:281:10 in main:Demo.Blogpost

In addition to testing individual shrinking steps, we can also test that for particular property and generator, we can generate a particular minimum using testMinimum. Let’s consider the naive list generator from the introduction one more time:

naiveList :: Range Int -> Gen a -> Gen [a]
naiveList r g = do
    n  <- Gen.integral r
    replicateM n g

Suppose we want to verify that if we use this generator to test the (false) property that “all elements of all lists are always equal” we should always get either [0, 1] or [1, 0] as a counter-example; after all, those are the two minimal counter-examples. We could test this as follows:

prop_naiveList_minimum :: Property ()
prop_naiveList_minimum =
    testMinimum (P.elem .$ ("expected", [[0,1], [1,0]])) $ do
      xs <- gen $ naiveList
                    (Range.between (0, 10))
                    (Gen.int (Range.between (0, 1)))
      case P.eval $ P.pairwise P.eq .$ ("xs", xs) of
        Left _   -> testFailed xs
        Right () -> return ()

The counter-example reported by falsify is (somewhat shortened):

naiveList_minimum: FAIL
  failed after 0 shrinks
  minimum `notElem` expected
  minimum : [0,0,1]
  expected: [[0,1],[1,0]]

Logs for failed test run:
generated [0,0,1] at CallStack (from HasCallStack):
  gen, called at demo/Demo/Blogpost.hs:294:13 in main:Demo.Blogpost

Logs for rejected potential next shrinks:

** Rejected run 0
generated [] at CallStack (from HasCallStack):

** Rejected run 3
generated [0] at CallStack (from HasCallStack):

** Rejected run 4
generated [0,0] at CallStack (from HasCallStack):

** Rejected run 8
generated [0,0,0] at CallStack (from HasCallStack):

This is telling us that the minimum value it produced was [0, 0, 1], instead of one of the two lists that we expected. It also tells us what shrink steps were rejected (because they weren’t counter-examples). This is informative, because _none of those shrink steps is [0, 1]: the naive list generator, unlike the real one (which does pass this property) cannot drop elements from the start of the list.

The falsify test suite uses testShrinkingOfGen (and its generalization testShrinking) as well as testMinimum extensively to test falsify’s own generators.

Compatibility

Finally, falsify offers two combinators shrinkWith and fromShrinkTree which provide compatibility with QuickCheck style shrinking

shrinkWith :: (a -> [a]) -> Gen a -> Gen a

and hedgehog style shrinking

fromShrinkTree :: Tree a -> Gen a

respectively. The implementation of these combinators depends on a minor generalization of the sample tree representation; the details are discussed in the paper.

Conclusions

Shrinking is an essential component of any approach to property based testing. In the Haskell world, two libraries offered two competing approaches to shrinking: manual shrinking offered by QuickCheck, where users are entirely responsible for writing shrinkers for their generators, and integrated shrinking, offered by hedgehog. Integrated shrinking is nice, but does not work well with monadic bind. The Python Hypothesis library taught us how we can have “internal” shrinking: like in integrated shrinking, we do not write a separate generator and shrinker, but unlike in integrated shrinking, this approach does work across monadic bind.

The Haskell falsify library takes the core ideas of Hypothesis and applies them in the context of Haskell. As we have seen, however, the actual details of how these two libraries work differ quite significantly. The falsify approach is more suitable to Haskell where we might deal with infinite data structures, provides the user with more predictable shrinking, and provides the user with tools for controlling generator independence (through the use of selective functors).

Footnotes

Admittedly, it depends a bit on how you count: we can also replace any tree in a single step by the tree that is zero everywhere, which could be considered a separat “pass”.↩︎
Hypothesis does try to avoid redistributing samples during shrinking as part of it’s “hierarchical delta debugging”, essentially recovering some kind of tree structure, but this is not under the control of the user, and does not provide any guarantees.↩︎
It can also produce {[4,5,6]->True, _->False}, depending on the seed.↩︎

falsify: Hypothesis-inspired shrinking for Haskell