An in-depth look at `quickcheck-state-machine`

Edsko de Vries – Wednesday, 23 January 2019

Stateful APIs are everywhere: file systems, databases, widget libraries, the list goes on. Automated testing of such APIs requires generating sequences of API calls, and when we find a failing test, ideally shrinking such a sequence to a minimal test case. Neither the generation nor the shrinking of such sequences is trivial. After all, it is the very nature of stateful systems that later calls may depend on earlier calls: we can only add rows to a database table after we create it, we can only write to a file after we open it, etc. Such dependencies need to be tracked carefully. Moreover, in order to verify the responses we get back from the system, the test needs to maintain some kind of internal representation of what it thinks the internal state of the system is: when we read from a file, we need to know what was in the file in order to be able to verify if the response was correct or not.

In this blog post we will take an in-depth look at quickcheck-state-machine, a library for testing stateful code. Our running example will be the development of a simple mock file system that should behave identically to a real file system. Although simple, the example will be large enough to give us an opportunity to discuss how we can verify that our generator is producing all test cases we need, and how we can inspect whether the shrinker is doing a good job; in both cases, test case labelling will turn out to be essential. Throughout we will also discuss design patterns for quickcheck-state-machine tests which improve separation of concerns and reduce duplication. It should probably be pointed out that this is an opinionated piece: there are other ways to set things up than we present here.

We will not show the full development in this blog post, and instead focus on explaining the underlying concepts. If you want to follow along, the code is available for download. We will assume version 0.6 of quickcheck-state-machine, which was recently released. If you are using an older version, it is recommended to upgrade, since the newer version includes some important bug fixes, especially in the shrinker.

Introducing the running example

Our running example will be the development of a simple mock file system; the intention is that its behaviour is identical to the real file system, within the confines of what it needs to support. We will represent the state of the file system as

data Mock = M {
    dirs  :: Set Dir
  , files :: Map File String
  , open  :: Map MHandle File
  , next  :: MHandle
  }

type MHandle = Int

emptyMock :: Mock
emptyMock = M (Set.singleton (Dir [])) Map.empty Map.empty 0

We record which directories (folders) exist, the contents of all files on the system, the currently open handles (where mock handles are just integers), and the next available mock handle. To avoid confusion between files and directories we do not use FilePath but instead use

data Dir  = Dir [String]
data File = File {dir :: Dir, name :: String}

As one example, here is the mock equivalent of readFile:

type MockOp a = Mock -> (Either Err a, Mock)

mRead :: File -> MockOp String
mRead f m@(M _ fs hs _)
  | alreadyOpen               = (Left Busy         , m)
  | Just s <- Map.lookup f fs = (Right s           , m)
  | otherwise                 = (Left DoesNotExist , m)
  where
    alreadyOpen = f `List.elem` Map.elems hs

We first check if there is an open handle to the file; if so, we disallow reading this file (“resource busy”); if the file exists, we return its content; otherwise we report a “does not exist” error. The implementation of the other mock functions is similar; the full API is

mMkDir :: Dir               -> MockOp ()
mOpen  :: File              -> MockOp MHandle
mWrite :: MHandle -> String -> MockOp ()
mClose :: MHandle           -> MockOp ()
mRead  :: File              -> MockOp String

Finally, we should briefly talk about errors; the errors that the mock file system can report are given by

data Err = AlreadyExists | DoesNotExist | HandleClosed | Busy

and they capture a subset of the IO exceptions¹

fromIOError :: IOError -> Maybe Err
fromIOError e =
    case ioeGetErrorType e of
      GHC.AlreadyExists    -> Just AlreadyExists
      GHC.NoSuchThing      -> Just DoesNotExist
      GHC.ResourceBusy     -> Just Busy
      GHC.IllegalOperation -> Just HandleClosed
      _otherwise           -> Nothing

Testing

Typically we are developing some stateful code, and then write a pure (mock) implementation of the same thing to test it, making sure that the stateful implementation and the simpler pure model compute the same things. Here we are doing the opposite: we are adjusting the model (the mock file system) to match what the real file system does. Either way, the process is the same: we write tests, execute them both in the real thing and in the model, and compare results.

If we were writing unit tests, we might write tests such as

Write to two different files
Write to a file and then read it
etc.

However, as John Hughes of QuviQ – and one of the original authors of QuickCheck – likes to point out, as systems grow, it becomes impossible to write unit tests that test the interaction between all the features of the system. So, don’t write unit tests. Instead, generate tests, and verify properties.

To generate tests for our mock file system, we have to generate sequences of calls into the API; “open this file, open that file, write to the first file we opened, …”. We then execute this sequence both against the mock file system and against the real thing, and compare results. If something goes wrong, we end up with a test case that we can inspect. Ideally, we should then try to reduce this test to try and construct a minimal test case that illustrates the bug. We have to be careful when shrinking: for example, when we remove a call to open from a test case, then any subsequent writes that used that file handle must also be removed. A library such as quickcheck-state-machine can be used both to help with generating such sequences and, importantly, with shrinking them.

Reifying the API

It is important that we generate the test before executing it. In other words, the test generation should not depend on any values that we only get when we run the test. Such a dependency makes it impossible to re-run the same test multiple times (no reproducible test cases) or shrink tests to obtain minimal examples. In order to do this, we need to reify the API: we need to define a data type whose constructors correspond to the API calls:

data Cmd h =
    MkDir Dir
  | Open File
  | Write h String
  | Close h
  | Read File

Cmd is polymorphic in the type of handles h; this is important, because we should be able to execute commands both against the mock file system and against the real file system:

runMock ::       Cmd MHandle -> Mock -> (.. Mock)
runIO   :: .. -> Cmd IO.Handle -> IO ..

What should the return type of these functions be? After all, different functions return different things: Open returns a new handle, Read returns a string, the other functions return unit. To solve this problem we will simply introduce a union type² for successful responses

data Success h = Unit () | Handle h | String String

A response is then either a succesful response or an error:

newtype Resp h = Resp (Either Err (Success h))

It is now easy to implement runMock: we just map all the constructors in Cmd to the corresponding API calls, and wrap the result in the appropriate constructor of Success:

runMock :: Cmd MHandle -> Mock -> (Resp MHandle, Mock)
runMock (MkDir d)   = first (Resp . fmap Unit)   . mMkDir d
runMock (Open  f)   = first (Resp . fmap Handle) . mOpen  f
runMock (Write h s) = first (Resp . fmap Unit)   . mWrite h s
runMock (Close h)   = first (Resp . fmap Unit)   . mClose h
runMock (Read  f)   = first (Resp . fmap String) . mRead  f

where first :: (a -> b) -> (a, x) -> (b, x) comes from Data.Bifunctor.

We can write a similar interpreter for IO; it will take a FilePath as an additional argument that it will use as a prefix for all paths; we will use this to run the IO test in some temporary directory.

runIO :: FilePath -> Cmd IO.Handle -> IO (Resp IO.Handle)

References

Our interpreter for IO takes real IO handles as argument; but will not have any real handles until we actually run the test. We need a way to generate commands that run in IO but don’t use real handles (yet). Here is where we see the first bit of infrastructure provided by quickcheck-state-machine, references:

data Reference a r = Reference (r a)

where we will instantiate that r parameter to either Symbolic or Concrete:³

data Symbolic a = Symbolic Var
data Concrete a = Concrete a

In other words, a Reference a Concrete is really just a wrapper around an a; indeed, quickcheck-state-machine provides

reference :: a -> Reference a Concrete
concrete  :: Reference a Concrete -> a

However, a Reference a Symbolic is a variable:

newtype Var = Var Int

An example of a program using symbolic references is

openThenWrite :: [Cmd (Reference IO.Handle Symbolic)]
openThenWrite = [
      Open (File (Dir []) "a")
    , Open (File (Dir []) "b")
    , Write (Reference (Symbolic (Var 0))) "Hi"
    ]

This program corresponds precisely to our example from earlier: “open this file, open that file, then write to the first file we opened”. Commands can return as many symbolic references in their result values as they want⁴; in our simple example, only Open creates a new reference, and so Var 0 returns to the handle returned by the first call to Open.

When we execute the test, those variables will be instantiated to their real values, turning symbolic references into concrete references. We will of course not write programs with symbolic references in them by hand; as we will see later, quickcheck-state-machine provides infrastructure for doing so.

Since we will frequently need to instantiate Cmd and Resp with references to handles, we will introduce some special syntax for this:

newtype At f r = At (f (Reference IO.Handle r))
type    f :@ r = At f r

For example, here is a wrapper around runIO that we will need that executes a command with concrete references:

semantics :: FilePath -> Cmd :@ Concrete -> IO (Resp :@ Concrete)
semantics root (At c) = (At . fmap reference) <$>
                          runIO root (concrete <$> c)

This is really just a call to runIO, with some type wrapping and unwrapping.

Relating the two implementations

When we run our tests, we will execute the same set of commands against the mock implementation and in real IO, and compare the responses we get after each command. In order to compare, say, a command “write to this MHandle” against the mock file system to a command “write to this IOHandle” in IO, we need to know the relation between the mock handles and the IO handles. As it turns out, the most convenient way to store this mapping is as a mapping from references to the IO handles (either concrete or symbolic) to the corresponding mock handles.

type HandleRefs r = [(Reference IO.Handle r, MHandle)]

(!) :: Eq k => [(k, a)] -> k -> a
env ! r = fromJust (lookup r env)

Then to compare the responses from the mock file system to the responses from IO we need to keep track of the state of the mock file system and this mapping; we will refer to this as the model for our test:

data Model r = Model Mock (HandleRefs r)

initModel :: Model r
initModel = Model emptyMock []

The model must be polymorphic in r: during test generation we will instantiate r to Symbolic, and during test execution we will instantiate r to Concrete.

Stepping the model

We want to work towards a function

transition :: Eq1 r => Model r -> Cmd :@ r -> Resp :@ r -> Model r

to step the model; we will gradually build up towards this. First, we can use the model to translate from commands or responses in terms of references to the corresponding commands or responses against the mock file system:

toMock :: (Functor f, Eq1 r) => Model r -> f :@ r -> f MHandle
toMock (Model _ hs) (At fr) = (hs !) <$> fr

Specifically, this can be instantiated to

toMock :: Eq1 r => Model r -> Cmd :@ r -> Cmd MHandle

which means that if we have a command in terms of references, we can translate that command to the corresponding command for the mock file system and execute it:

step :: Eq1 r => Model r -> Cmd :@ r -> (Resp MHandle, Mock)
step m@(Model mock _) c = runMock (toMock m c) mock

In order to construct the full new model however we also need to know how to extend the handle mapping. We can compute this by comparing the response we get from the “real” semantics (Resp :@ r) to the response we get from the mock semantics (from step), and simply zip the handles from both responses together to obtain the new mapping. We wrap all this up into an event:

data Event r = Event {
    before   :: Model  r
  , cmd      :: Cmd :@ r
  , after    :: Model  r
  , mockResp :: Resp MHandle
  }

and we construct an event from a model, the command we executed, and the response we got from the real implementation:

lockstep :: Eq1 r
         => Model   r
         -> Cmd  :@ r
         -> Resp :@ r
         -> Event   r
lockstep m@(Model _ hs) c (At resp) = Event {
      before   = m
    , cmd      = c
    , after    = Model mock' (hs <> hs')
    , mockResp = resp'
    }
  where
    (resp', mock') = step m c
    hs' = zip (toList resp) (toList resp')

The function we mentioned at the start of this section is now easily derived:

transition :: Eq1 r => Model r -> Cmd :@ r -> Resp :@ r -> Model r
transition m c = after . lockstep m c

as well as a function that compares the mock response and the response from the real file system and checks that they are the same:

postcondition :: Model   Concrete
              -> Cmd  :@ Concrete
              -> Resp :@ Concrete
              -> Logic
postcondition m c r = toMock (after e) r .== mockResp e
  where
    e = lockstep m c r

We will pass this function to quickcheck-state-machine to be run after every command it executes to make sure that the model and the real system do indeed return the same responses; it therefore does not need to be polymorphic in r. (Logic is a type introduced by quickcheck-state-machine; think of it as a boolean with some additional information, somewhat similar to QuickCheck’s Property type.)

Events will also be very useful when we label our tests; more on that later.

Constructing symbolic responses

We mentioned above that we will not write programs with symbolic references in it by hand. Instead what will happen is that we execute commands in the mock file system, and then replace any of the generated handles by new variables. Most of this happens behind the scenes by quickcheck-state-machine, but we do need to give it this function to construct symbolic responses:

symbolicResp :: Model Symbolic
             -> Cmd :@ Symbolic
             -> GenSym (Resp :@ Symbolic)
symbolicResp m c = At <$> traverse (const genSym) resp
  where
    (resp, _mock') = step m c

This function does what we just described: we use step to execute the command in the mock model, and then traverse the response, constructing a new (fresh) variable for each handle. GenSym is a monad defined in quickcheck-state-machine for the sole purpose of generating variables; we won’t use it anywhere else except in this function.

Generating commands

To generate commands, quickcheck-state-machine requires a function that produces the next command given the current model; this function will be a standard QuickCheck generator. For our running example, the generator is easy to write:

generator :: Model Symbolic -> Maybe (Gen (Cmd :@ Symbolic))
generator (Model _ hs) = Just $ QC.oneof $ concat [
      withoutHandle
    , if null hs then [] else withHandle
    ]
  where
    withoutHandle :: [Gen (Cmd :@ Symbolic)]
    withoutHandle = [
          fmap At $ MkDir <$> genDir
        , fmap At $ Open  <$> genFile
        , fmap At $ Read  <$> genFile
        ]

    withHandle :: [Gen (Cmd :@ Symbolic)]
    withHandle = [
          fmap At $ Write <$> genHandle <*> genString
        , fmap At $ Close <$> genHandle
        ]

    genDir :: Gen Dir
    genDir = do
        n <- QC.choose (0, 3)
        Dir <$> replicateM n (QC.elements ["x", "y", "z"])

    genFile :: Gen File
    genFile = File <$> genDir <*> QC.elements ["a", "b", "c"]

    genString :: Gen String
    genString = QC.sized $ \n -> replicateM n (QC.elements "ABC")

    genHandle :: Gen (Reference IO.Handle Symbolic)
    genHandle = QC.elements (map fst hs)

A few comments on the generator:

When we generate paths, we choose from a very small set of directory and file names. We are not really interested in testing that, for example, our implementation is Unicode-safe here; by picking from a small known set we generate tests that are more likely to contain multiple references to the same file, without having to add special provisions to do so.
We cannot, of course, generate handles out of nowhere; fortunately, the model tells us which handles we have available, and so genHandle just picks one at random from that. We do not limit the generator to picking open handles: it is important to also test the behaviour of the system in the error case where a program attempts to write to a closed handle.
The elements function from QuickCheck, which picks a random element from a list, is partial: it will throw an error if the list is empty. We must be careful not to generate any commands requiring handles when we have no handles yet.
The reason for the Maybe in the type signature is that in some circumstances a generator might be unable to generate any commands at all given a particular model state. We don’t need to take advantage of this feature and therefore always return Just a generator.

We can also define a shrinker for commands, but we can start with a shrinker that says that individual commands cannot be shrunk:

shrinker :: Model Symbolic -> Cmd :@ Symbolic -> [Cmd :@ Symbolic]
shrinker _ _ = []

Putting it all together

In order to actually start generating and running tests, quickcheck-state-machine wants us to bundle all of this functionality up in a single record:

sm :: FilePath -> StateMachine Model (At Cmd) IO (At Resp)
sm root = StateMachine {
      initModel     = initModel
    , transition    = transition
    , precondition  = precondition
    , postcondition = postcondition
    , invariant     = Nothing
    , generator     = generator
    , distribution  = Nothing
    , shrinker      = shrinker
    , semantics     = semantics root
    , mock          = symbolicResp
    }

We have defined all of these functions above, with the exception of precondition. When quickcheck-state-machine finds a failing test, it will try to shrink it to produce a smaller test case. It does this by trying to remove commands from the program and then checking if the resulting program still “makes sense”: does the precondition of all commands still hold?

The precondition should be as liberal as possible; for instance, the precondition should not require that a file handle is open before writing to it, because we should also check that errors are handled correctly. Typically, the only thing the precondition should verify is that commands do not refer to non-existent handles (i.e., if we remove a call to open, then subsequent uses of the handle returned by that call to open simply cannot be executed anymore). Thus, we will define:

precondition :: Model Symbolic -> Cmd :@ Symbolic -> Logic
precondition (Model _ hs) (At c) =
    forall (toList c) (`elem` map fst hs)

Aside. Although we do not need it for our running example, one other thing a precondition might do is rule out test cases that we explicitly don’t want to test. For example, if our mock file system throws an error in some cases because a particular combination of arguments to a call is simply not supported, then we don’t want to flag this as a bug. A clean way to implement this is to extend the error type with a field that marks an error as an intentional limitation; then the precondition can simply rule out any commands that (in the model) would return an error with this mark. This keeps the definition of the precondition itself simple, and the logic of what is and what isn’t an intentional limitation lives in the implementation of the model itself.

Running the tests

Apart from some additional type class instances (see the code), we are now ready to define and execute the actual test:

prop_sequential :: FilePath -> QC.Property
prop_sequential tmpDir =
    forAllCommands (sm rootUnused) Nothing $ \cmds ->
      QC.monadicIO $ do
        tstTmpDir <- liftIO $ createTempDirectory tmpDir "QSM"
        let sm' = sm tstTmpDir
        (hist, _model, res) <- runCommands sm' cmds
        prettyCommands sm' hist
          $ checkCommandNames cmds
          $ res QC.=== Ok

rootUnused :: FilePath
rootUnused = error "root not used during command generation"

All functions prefixed QC are standard QuickCheck functions. The others come from quickcheck-state-machine:

forAllCommands uses QuickCheck’s forAllShrink and instantiates it will the command generation and shrinking infrastructure from quickcheck-state-machine
runCommands then executes the generated commands, validating the postcondition at every step.
prettyCommands renders those commands in case of a test failure.
checkCommandNames adds some statistics about the distribution of commands in the generated tests.

We can run the test in ghci:

> quickCheck (prop_sequential "./tmp")
+++ OK, passed 100 tests:
56% coverage

 3% [("MkDir",1),("Read",1)]
 3% []
 2% [("MkDir",1),("Open",1),("Read",1)]
 2% [("MkDir",1)]
 1% [("Close",1),("MkDir",1),("Open",5),("Read",4),("Write",2)]
...

It tells us that all tests passed, and gives us some statistics about the tests that were executed: in 3% of the cases, they contained a single MkDir and a single Read, 3% were completely empty, 2% contained one call to MkDir, one call to Open, one call to Read, and so on.

Labelling

At this point you might conclude that you’re done. We have the real implementation, we have the mock implementation, they return identical results for all tests, what else could we want?

Let’s think back to those unit tests we were on the verge of writing, but stopped just in time because we remembered that we should generate unit tests, not write them:

Write to two different files
Write to a file and then read it

How do we know that our generated tests include these two cases (and all the other unit tests that we would have written)? We get some statistics from quickcheck-state-machine, but it’s not terribly helpful. For example, the first line above tells us that 3% of our test cases contain one call to MkDir and one call to Read; but we know that that call to Read must fail, because if these are the only two commands we execute, there aren’t any files for that Read to read.

The solution is to label tests. For example, we might introduce labels, or tags, that correspond to the two unit tests above:

data Tag = OpenTwo | SuccessfulRead

We then need to write a function that looks at a particular test case and checks which labels apply. It is important to realize that this does not mean we are bringing back the same unit tests under a different guise: programs that will be labelled OpenTwo must write to two different files, but may also do a whole bunch of other things.

We can use use the foldl package to write such a labelling function in a convenient manner. The labelling function will look at the Events and produce Tags. To check if we open (at least) two different files, we keep track of all the successful calls to open; for SucessfulRead we simply remember if we have seen a call to Read with a non-error result:

tag :: [Event Symbolic] -> [Tag]
tag = Foldl.fold $ catMaybes <$> sequenceA [
      openTwo
    , successfulRead
    ]
  where
    openTwo :: Fold (Event Symbolic) (Maybe Tag)
    openTwo = Fold update Set.empty extract
      where
        update :: Set File -> Event Symbolic -> Set File
        update opened ev =
            case (cmd ev, mockResp ev) of
              (At (Open f), Resp (Right _)) ->
                Set.insert f opened
              _otherwise ->
                opened

        extract :: Set File -> Maybe Tag
        extract opened = do
            guard (Set.size opened >= 2)
            return $ OpenTwo

    successfulRead :: Fold (Event Symbolic) (Maybe Tag)
    successfulRead = Fold update False extract
      where
        update :: Bool -> Event Symbolic -> Bool
        update didRead ev = didRead ||
            case (cmd ev, mockResp ev) of
              (At (Read _), Resp (Right _)) ->
                True
              _otherwise ->
                False

        extract :: Bool -> Maybe Tag
        extract didRead = do
            guard didRead
            return SuccessfulRead

(For a read to be successful, we must have created the file first – this is a property of the semantics, we don’t need to enforce this in the labeller.)

The commands we get back from the forAllCommands function of quickcheck-state-machine are of type Commands. This is a simple wrapper around a list of Command; Command in turn bundles a command (Cmd) along with its response (and the variables in that response). We can therefore easily turn this into a list of events:

execCmd :: Model Symbolic
        -> Command (At Cmd) (At Resp)
        -> Event Symbolic
execCmd model (Command cmd resp _vars) =
    lockstep model cmd resp

execCmds :: Commands (At Cmd) (At Resp) -> [Event Symbolic]
execCmds = \(Commands cs) -> go initModel cs
  where
    go :: Model Symbolic
       -> [Command (At Cmd) (At Resp)]
       -> [Event Symbolic]
    go _ []       = []
    go m (c : cs) = e : go (after e) cs
      where
        e = execCmd m c

We can then define a variant on prop_sequential above that replaces checkCommandNames with our own labelling function (you could also use both, if you wanted to):

prop_sequential' :: FilePath -> QC.Property
prop_sequential' tmpDir =
    forAllCommands (sm rootUnused) Nothing $ \cmds ->
      QC.monadicIO $ do
        tstTmpDir <- liftIO $ createTempDirectory tmpDir "QSM"
        let sm' = sm tstTmpDir
        (hist, _model, res) <- runCommands sm' cmds
        prettyCommands sm' hist
          $ QC.tabulate "Tags" (map show $ tag (execCmds cmds))
          $ res QC.=== Ok

Here tabulate is the standard QuickCheck function for adding multiple labels to a test case. If we now re-run our tests, we get

> quickCheck (prop_sequential' "./tmp")
+++ OK, passed 100 tests.

Tags (57 in total):
63% OpenTwo
37% SuccessfulRead

The numbers here are a bit awkward to interpret: it says that across all tests a total of 57 labels were found, and of those 57 labels, 63% were OpenTwo and 37% were SuccessfulRead. In other words, 63% * 57 = 36 tags were OpenTwo (36 tests out of 100 were labelled as OpenTwo), and 37% * 57 = 21 tags were SuccessfulRead (21 tests out of 100 were labelled as SuccessfulRead). Note that it is perfectly possible for these two sets of tests to overlap (i.e., a single program can be tagged as both OpenTwo and SuccessfulRead).

Inspecting the labelling function

So we have a generator, and we have labels for the unit tests that we didn’t write; have we covered our bases now? Well, how do we know that our labelling function is correct? We might seem to descent into infinite regress here, testing the tests that tests the tests… but it would be good to at least have a look at some test case examples and how they are labelled. Fortunately, this is functionality that QuickCheck provides out of the box through a function called labelledExamplesWith. We can define a simple wrapper around it that gives us the possibility to specify a particular seed (which will be useful once we start working on shrinking):

showLabelledExamples :: Maybe Int -> IO ()
showLabelledExamples mReplay = do
    replaySeed <- case mReplay of
                    Nothing   -> getStdRandom $ randomR (1, 999999)
                    Just seed -> return seed

    putStrLn $ "Using replaySeed " ++ show replaySeed

    let args = QC.stdArgs {
            QC.maxSuccess = 10000
          , QC.replay     = Just (QC.mkQCGen replaySeed, 0)
          }

    QC.labelledExamplesWith args $
      forAllCommands (sm rootUnused) Nothing $ \cmds ->
        repeatedly QC.collect (tag . execCmds $ cmds) $
          QC.property True

Instead of tabulate we must use collect to label examples when constructing examples; collect takes a single label at a time though, so we use my all-time favourite combinator to repeatedly call collect for all tags:

repeatedly :: (a -> b -> b) -> ([a] -> b -> b)
repeatedly = flip . List.foldl' . flip

(I wonder if I can patent this function?)

When we run this, we see something like

> showLabelledExamples Nothing
Using replaySeed 288187
*** Found example of OpenTwo
Commands [
    .. Open (File (Dir []) "b")
  , .. Open (File (Dir []) "a")
  ]

*** Found example of SuccessfulRead
Commands [
    .. Open (File (Dir []) "b")
  , .. Close (Reference (Symbolic (Var 0)))
  , .. Read (File (Dir []) "b")
  ]

(I’ve tidied up the output a bit and removed some redundant information.)

This is looking good: both of these look like the kind of examples we would have written ourselves for these tags.

Standard shrinking

In fact, if we look at those test cases at the end of the previous section, you might be thinking that those examples look surprisingly good: not only are they indeed instances of those tags, but they are very small test cases: they are pretty much the unit tests that we would have written if we were so misguided as to think we need to write unit tests. Were we simply lucky?

No, we were not. QuickCheck’s labelledExamples not only searches for labelled test cases, it also tries to shrink them when it finds one, and continues to shrink them until it can shrink them no further without losing the label. The shrinker itself is implemented in quickcheck-state-machine; if we disable it altogether and re-run the search for labelled examples, we might find an example such as the following for SuccessfulRead, where for clarity I’ve marked all failing commands with an asterisk (*):

Commands [
    .. Open (File (Dir ["z", "x", "y"]) "c")    (*)
  , .. MkDir (Dir ["z", "x")                    (*)
  , .. Read (File (Dir ["z", "y", "y"]) "c")    (*)
  , .. Read (File (Dir []) "c")                 (*)
  , .. MkDir (Dir [ "x" , "x" ])                (*)
  , .. Read (File (Dir ["x", "z", "x"]) "b")    (*)
  , .. MkDir (Dir ["y"])
  , .. MkDir (Dir [])                           (*)
  , .. Open (File (Dir ["z"]) "b")              (*)
  , .. Open (File (Dir []) "a")
  , .. MkDir (Dir [])                           (*)
  , .. Open (File (Dir ["x"]) "b")              (*)
  , .. Close (Reference (Symbolic (Var 0)))
  , .. Close (Reference (Symbolic (Var 0)))
  , .. Open (File (Dir ["x", "y"]) "b")         (*)
  , .. Open (File (Dir []) "b")
  , .. MkDir (Dir ["y"])                        (*)
  , .. Read (File (Dir []) "a")
  , .. MkDir (Dir ["z"])
  , .. Close (Reference (Symbolic (Var 0)))
  , .. Close (Reference (Symbolic (Var 1)))
  , .. Open (File (Dir ["z" , "z" , "y"]) "a")  (*)
  , .. Open (File (Dir ["x" , "x" , "z"]) "c")  (*)
  , .. Close (Reference (Symbolic (Var 0)))
  , .. Close (Reference (Symbolic (Var 0)))
  , .. Open (File (Dir ["y", "y"]) "b")         (*)
  , .. Read (File (Dir ["x", "y", "x"]) "a")    (*)
  ]

This is looking significantly less ideal! If there was a bug in read, then this would certainly not be a very good minimal test case, and not something you would want to debug. So how does quickcheck-state-machine shrink tests? The basic approach is quite simple: it simply removes commands from the program. If the resulting program contains commands whose precondition is not satisfied (remember, for our running example this means that those commands would use handles that are no longer created) then it discards it as a possible shrink candidate; otherwise, it reruns the test, and if it still fails, repeats the process.

The large percentage of commands that are unsuccessful can easily be removed by quickcheck-state-machine:

Commands [
    .. MkDir (Dir ["y"])
  , .. Open (File (Dir []) "a")
  , .. Close (Reference (Symbolic (Var 0)))
  , .. Close (Reference (Symbolic (Var 0)))
  , .. Open (File (Dir []) "b")
  , .. Read (File (Dir []) "a")
  , .. MkDir (Dir ["z"])
  , .. Close (Reference (Symbolic (Var 0)))
  , .. Close (Reference (Symbolic (Var 1)))
  , .. Close (Reference (Symbolic (Var 0)))
  , .. Close (Reference (Symbolic (Var 0)))
  ]

Both calls to MkDir can easily be removed, and the resulting program would still be tagged as SuccessfulRead; and the same is true for repeated closing of the same handle:

Commands [
    .. Open (File (Dir []) "a")
  , .. Close (Reference (Symbolic (Var 0)))
  , .. Open (File (Dir []) "b")
  , .. Read (File (Dir []) "a")
  , .. Close (Reference (Symbolic (Var 1)))
  ]

At this point the shrinker cannot remove the second call to Open because the second call to Close depends on it, but it can first remove that second call to Close and then remove that second call to Open, and we end up with the minimal test case that we saw in the previous section:

Commands [
    .. Open (File (Dir []) "a")
  , .. Close (Reference (Symbolic (Var 0)))
  , .. Read (File (Dir []) "a")
  ]

Perfect.

Improving shrinking

Unfortunately, this does not mean that we can depend on quickcheck-state-machine to solve all our shrinking problems. Consider this run of showLabelledExamples:

> showLabelledExamples (Just 166205)
Using replaySeed 166205
*** Found example of OpenTwo
Commands [
    .. MkDir (Dir ["x"])
  , .. Open (File (Dir [])    "c")
  , .. Open (File (Dir ["x"]) "a")
  ]

This is indeed an example of a program in which we open at least two files; however, why is that call to MkDir still there? After all, if there was a bug in opening more than one file, and the “minimal” test case would include a call to MkDir, that would be a red herring which might send the person debugging the problem down the wrong path.

The reason that quickcheck-state-machine did not remove the call to MkDir, of course, is that without it the second call to Open would fail: it tries to create a file in directory x, and if that directory does not exist, it would fail. To fix this, we need to tell quickcheck-state-machine how to shrink individual commands; so far we have been using

shrinker :: Model Symbolic -> Cmd :@ Symbolic -> [Cmd :@ Symbolic]
shrinker _ _ = []

which says that individual commands cannot be shrunk at all. So how might we shrink an Open call? One idea might be to shrink the directory, replacing for instance /x/y/a by /x/a or indeed just /a. We can implement this using

shrinker :: Model Symbolic -> Cmd :@ Symbolic -> [Cmd :@ Symbolic]
shrinker _ (At cmd) =
    case cmd of
      Open (File (Dir d) f) ->
        [At $ Open (File (Dir d') f) | d' <- QC.shrink d]
      _otherwise ->
        []

If we use this shrinker and run showLabelledExamples a number of times, we will find that all the examples of OpenTwo are now indeed minimal… until it finds an example that isn’t:

> showLabelledExamples (Just 980402)
Using replaySeed 980402
*** Found example of OpenTwo
Commands [
    .. MkDir (Dir ["x"]))
  , .. Open (File (Dir [])    "a")
  , .. Open (File (Dir ["x"]) "a")
  ]

In this example we cannot shrink the directory to [] because the resulting program would try to open the same file twice, which is not allowed (“resource busy”). We need a better way to shrink this program.

What we want to implement is “try to replace the file path by some file in the (local) root. It is important to realize however that a shrinker such as

shrinker :: Model Symbolic -> Cmd :@ Symbolic -> [Cmd :@ Symbolic]
shrinker _ (At cmd) =
    case cmd of
      Open _ -> -- BAD EXAMPLE
        [At $ Open (File (Dir []) f') | f' <- ["t1", "t2"]]
      _otherwise ->
        []

is a bad idea. This shrinker tries to replace any file path with either /t1 or /t2. However, this means that shrinking now never stops:

Open (File (Dir [] "t1"))

can be shrunk to

Open (File (Dir [] "t2"))

which can then be shrunk back to

Open (File (Dir [] "t1"))

and QuickCheck will loop when trying to shrink the test case. It is important that there is a clear direction to shrinking.

An approach that does work is the following: any file path that doesn’t start with a t can be replaced by path /t100; moreover, any /tN (for some number N) can be replaced by /tN' for some number N’ < N:

shrinker :: Model Symbolic -> Cmd :@ Symbolic -> [Cmd :@ Symbolic]
shrinker _ (At cmd) =
    case cmd of
      Open (File (Dir []) ('t' : n)) ->
        [openTemp n' | n' <- QC.shrink (read n)]
      Open _ ->
        [openTemp 100]
      _otherwise ->
        []
  where
    openTemp :: Int -> Cmd :@ Symbolic
    openTemp n = At $ Open (File (Dir []) ('t' : show n))

Now

Commands [
    .. MkDir (Dir ["x"]))
  , .. Open (File (Dir [])    "a")
  , .. Open (File (Dir ["x"]) "a")
  ]

can shrink to

Commands [
    .. MkDir (Dir ["x"]))
  , .. Open (File (Dir []) "a")
  , .. Open (File (Dir []) "t100")
  ]

at which point the call to MkDir can be removed; this will eventually shrink down to

Commands [
    .. Open (File (Dir []) "t0")
  , .. Open (File (Dir []) "t1")
  ]

Dependencies between commands

It’s been a long road, but we are almost there. The last thing we need to discuss is how to shrink programs with dependencies. The “open at least two” example above was relatively easy to shrink because we could shrink one command at the time. Sometimes however there are dependencies between commands. For example, consider this “minimal” example of “successful read”:

> showLabelledExamples (Just 617213)
Using replaySeed 617213
*** Found example of SuccessfulRead
Commands [
    .. MkDir (Dir [ "z" ])
  , .. Open (File (Dir ["z"]) "b")
  , .. Close (Reference (Symbolic (Var 0)))
  , .. Read (File (Dir ["z"]) "b")
  ]

As before, we have a call to MkDir which should ideally be removed. However, if we tried to change the path in the call to Open, the call to Read would fail: both of these commands should refer to the same path. But shrinking can only change one command at a time, and this is important to keep the computational complexity (runtime cost) of shrinking down. What to do?

The problem is that we want that call to Read to use the same path as the call to Open, but we have no way to express this. The solution is to make this expressible. After all, we can already express “the handle returned by that call to Open”; all we need to do is introduce a second kind of reference: a reference to a path.

The language of commands changes to

data Cmd fp h = Read (Expr fp) | -- rest as before

Cmd gets an additional type argument fp to record the types of paths, and instead of a File, Read now takes an Expr as argument:

data Expr fp = Val File | Var fp

We can of course still use a concrete path, as before, but we can also use a variable: “use the same path as used in that call to open”. This means that Open must return that reference, so Success and Resp get adjusted accordingly:

data Success fp h = Handle fp h | -- rest as before
newtype Resp fp h = Resp (Either Err (Success fp h))

Just like was the case for handles, when we actually run code all variables have been resolved, so the interpreter isn’t any more difficult:

runMock :: Cmd File MHandle -> Mock -> (Resp File MHandle, Mock)
runMock (Open f) = first (Resp . fmap (Handle f)) . mOpen f
runMock (Read e) = first (Resp . fmap String)     . mRead (eval e)
-- rest as before

eval :: Expr File -> File
eval (Val f) = f
eval (Var f) = f

The IO interpreter is modified similarly. Most of the rest of the changes to the code are minor and mostly automatic. For example, At must now instantiate both parameters

newtype At f r = At (f (Reference File r) (Reference IO.Handle r))

the model must record the mapping from file references now too

type FileRefs r = [(Reference File r, File)]
data Model    r = Model Mock (FileRefs r) (HandleRefs r)

See Version2.hs in the example package for details.

Crucially, we can now take advantage of this in the shrinker: when we see a call to Read with a file path that we have seen before, we can shrink that to use a variable instead:

shrinker :: Model Symbolic -> Cmd :@ Symbolic -> [Cmd :@ Symbolic]
shrinker (Model _ fs _) (At cmd) =
    case cmd of
      Read (Val f) ->
        [At $ Read (Var r) | r <- mapMaybe (matches f) fs]
      -- other cases as before
  where
    matches :: File -> (r, File) -> Maybe r
    matches f (r, f') | f == f'   = Just r
                      | otherwise = Nothing

This means that the problematic example⁵

> showLabelledExamples (Just 617213)
Using replaySeed 617213
*** Found example of SuccessfulRead
Commands [
    .. MkDir (Dir [ "z" ])
  , .. Open (File (Dir ["z"]) "b")
  , .. Close (Reference (Symbolic (Var 1)))
  , .. Read (Val (File (Dir ["z"]) "b"))
  ]

can now shrink to

Commands [
    .. MkDir (Dir [ "z" ])
  , .. Open (File (Dir ["z"]) "b")
  , .. Close (Reference (Symbolic (Var 1)))
  , .. Read (Var 0)
  ]

At this point the shrinking for Open that we defined above can kick in, replacing z/b with /t100, making the call to MkDir redundant, and the example can ultimately shrink to

Commands [
    .. Open (File (Dir []) "t0")
  , .. Close (Reference (Symbolic (Var 1)))
  , .. Read (Var 0)
  ]

Conclusions

Writing unit tests by hand is problematic for at least two reasons. First, as the system grows in complexity the number of interactions between the various features will be impossible to capture in hand written unit tests. Second, unit tests can only exercise the system in ways that the author of the unit tests foresees; often, bugs arise when the system is exercised in ways that were not foreseen. It is therefore important not to write unit tests, but rather generate them. We have seen how a library such as quickcheck-state-machine can assist in generating and shrinking meaningful sequences of API calls.

We have also seen why it is important to label test cases, how to inspect the labelling function, and how to use labelling to improve shrinking. Without writing a shrinker for individual commands, the standard quickcheck-state-machine shrinker already does a really good job at removing redundant commands. However, if we are serious about producing minimal test cases without red herrings that might lead debugging efforts astray (and we should be serious about that), then we also need to put some thought into shrinking individual commands.

Finally, we have seen how we can improve shrinking by making dependencies between commands explicit. This also serves as an example of a language with multiple types of references; the approach we put forth in this blog post essentially scales to an arbitrary number of types of references without much difficulty.

We have not talked about running parallel tests at all in this blog post. This is an interesting topic in its own right, but affects only the very outermost level of the test: prop_sequential would get an analogue prop_parallel, and nothing else would be affected; the quickcheck-state-machine documentation shows how to do this; the original paper describing the approach (in Erlang) by John Hughes and others is also well worth a read. Finally, quickcheck-state-machine is not the only library providing this functionality in Haskell; in particular, hedgehog, an alternative to QuickCheck, does also.

The way we set up the tests in this blog post is not the only way possible, but is one that we believe leads to a clean design. The running example in this blog post is a simplified version of a mock file system that we use in the new Ouroboros consensus layer, where we use it to simulate file system errors when testing a blockchain database. The tests for that blockchain database in turn also use the design patterns laid out in this blog post, and we have used those design patterns also in a test we added to quickcheck-state-machine itself to test the shrinker. Of course, being Haskellers, we would prefer to turn design patterns into actual code; indeed, we believe this to be possible, but it may require some more advanced type system features. This is something we want to investigate further.

Footnotes

Mapping from IllegalOperation to HandleClosed is bit too coarse, but suffices for the sake of this blog post.↩︎
This has a somewhat unpleasant untyped feel to it. However, if the design patterns proposed in this blog post are used, this barely affects any code that we write: we never (or almost never) have to pattern match on that Success type.↩︎
I’m eliding a Typeable constraint here.↩︎
Technically, quickcheck-state-machine uses a function Resp :@ Symbolic -> [Var] to find all the references “bound” by a command.↩︎
The Close call now refers to Var 1 instead of Var 0, because Open now creates two variables: the reference to the path and the reference to the handle.↩︎

An in-depth look at quickcheck-state-machine