# Efficient Amortised and Real-Time Queues in Haskell

Friday, 15 January 2016, by Edsko de Vries.

Filed under coding.

A queue is a datastructure that provides efficient—O(1)—operations to remove an element from the *front* of the queue and to insert an element at the *rear* of the queue. In this blog post we will discuss how we can take advantage of laziness to implement such queues in Haskell, both with amortised and with worst-case O(1) bounds.

The results in this blog post are not new, and can be found in Chris Okasaki’s book “Purely Functional Data Structures”. However, the implementation and presentation here is different from Okasaki’s. In particular, the technique we use for real-time datastructures is more explicit and should scale to datastructures other than queues more easily than Okasaki’s.

## Non-solution: Lists

To set the stage, consider this first attempt at implementing queues:

```
class Queue q where
empty :: q a
head :: q a -> a
tail :: q a -> q a
snoc :: q a -> a -> q a
data Queue0 a = Q0 [a]
instance Queue Queue0 where
empty = Q0 []
head (Q0 (x:_ )) = x
tail (Q0 (_:xs)) = Q0 xs
snoc (Q0 xs ) x = Q0 (xs ++ [x])
```

What is the complexity of `head`

and `snoc`

in this representation? Your first instinct might be to say that `head`

has O(1) complexity (after all, it doesn’t do anything but a pattern match) and that `snoc`

has O(*n*) complexity, because it needs to traverse the entire list before it can append the element.

However, Haskell is a lazy language. All that happens when we call `snoc`

is that we create a thunk (a suspended computation), which happens in O(1) time. Consider adding the elements `[1..5]`

into an empty queue, one at a time:

```
Q0 []
Q0 ([] ++ [1])
Q0 (([] ++ [1]) ++ [2])
Q0 ((([] ++ [1]) ++ [2]) ++ [3])
Q0 (((([] ++ [1]) ++ [2]) ++ [3]) ++ [4])
Q0 ((((([] ++ [1]) ++ [2]) ++ [3]) ++ [4]) ++ [5])
```

Now when we call `head`

on the resulting queue, `(++)`

needs to traverse this entire chain before it can find the first element; since that chain has O(*n*) length, the complexity of `head`

is O(*n*).

## Strict, Non-Persistent Queues

Thinking about complexity in a lazy setting can be confusing, so let’s first think about a spine strict queue. In order to define it, we will need a spine-strict list:

`data StrictList a = SNil | SCons a !(StrictList a)`

A bang annotation here means each evaluating an `SCons`

node to weak-head normal form (for instance by pattern matching on it) will also force its tail to weak head normal form, and hence the entire spine of the list; we cannot have an `SCons`

node with a pointer to an unevaluated tail.

We will also need a few operations on strict lists:

```
-- | Append two strict lists
app :: StrictList a -> StrictList a -> StrictList a
app SNil ys = ys
app (SCons x xs) ys = SCons x (app xs ys)
-- | Reverse a strict list
rev :: StrictList a -> StrictList a
rev = go SNil
where
go :: StrictList a -> StrictList a -> StrictList a
go acc SNil = acc
go acc (SCons x xs) = go (SCons x acc) xs
```

The definition of strict lists in hand, we can attempt our next queue implementation:

`data Queue1 a = Q1 !Int !(StrictList a) !Int !(StrictList a)`

Instead of using a single list, we split the queue into two parts: the *front* of the queue and the *rear* of the queue. The front of the queue will be stored in normal order, so that we can easily remove elements from the front of the queue; the rear of the queue will be stored in reverse order, so that we can also easily insert new elements at the end of the queue.

In addition, we also record the size of both lists. We will use this to enforce the following invariant:

Queue Invariant: The front of the queue cannot be shorter than the rear.

(Simpler invariants are also possible, but this invariant is the one we will need later so we will use it throughout this blogpost.)

When the invariant is violated, we restore it by moving the elements from the rear of the queue to the front; since the rear of the queue is stored in reverse order, but the front is not, the rear must be reversed:

```
inv1 :: Queue1 a -> Queue1 a
inv1 q@(Q1 f xs r ys)
| f < r = Q1 (f+r) (xs `app` rev ys) 0 SNil
| otherwise = q
```

The invariant can be violated when we shrink the front or grow the rear, so we end up with this implementation of the `Queue`

interface:

```
instance Queue Queue1 where
empty = Q1 0 SNil 0 SNil
head (Q1 _ (SCons x _ ) _ _ ) = x
tail (Q1 f (SCons _ xs) r ys) = inv1 $ Q1 (f-1) xs r ys
snoc (Q1 f xs r ys) y = inv1 $ Q1 f xs (r+1) (SCons y ys)
```

### Worst-Case versus Amortised Complexity

Since we don’t have to think about laziness, the complexity of this queue implementation is a bit easier to determine. Clearly, `head`

is O(1), and both `tail`

and `snoc`

have worst case `O(n)`

complexity because `rev`

has O(*n*) complexity. However, consider what happens when we insert [1..7] into an empty queue:

```
Q1 0 [] 0 []
Q1 1 [1] 0 [] -- invariant restored
Q1 1 [1] 1 [2]
Q1 3 [1..3] 0 [] -- invariant restored
Q1 3 [1..3] 1 [4]
Q1 3 [1..3] 2 [5,4]
Q1 3 [1..3] 3 [6,5,4]
Q1 7 [1..7] 0 [] -- invariant restored
```

Notice what happens: we only need to reverse *n* elements **after having inserted n elements**; we therefore say that the

*amortised*complexity (the complexity averaged over all operations) of the reverse is in fact O(1)—with one proviso, as we shall see in the next section.

### Amortisation versus Persistence

The analysis in the previous section conveniently overlooked one fact: since values are immutable in Haskell, nothing is stopping us from reusing a queue multiple times. For instance, if we started from

`Q1 3 [1..3] 3 [6,5,4]`

we might attempt to insert 7, then 8, then 9, and finally 10 into this (same) queue:

```
Q1 7 [1,2,3,4,5,6,7] 0 [] -- invariant restored
Q1 7 [1,2,3,4,5,6,8] 0 [] -- invariant restored
Q1 7 [1,2,3,4,5,6,9] 0 [] -- invariant restored
Q1 7 [1,2,3,4,5,6,10] 0 [] -- invariant restored
```

Notice that *each* of these single insertions incurs the full cost of a reverse. Thus, claiming an amortised O(1) complexity is only valid if we use the queue linearly (i.e., never reusing queues). If we want to lift this restriction, we need to take advantage of laziness.

## Amortised Complexity for Persistent Queues

In order to get amortised constant time bounds even when the queue is not used linearly, we need to take advantage of lazy evaluation. We will change the front of the queue back to be a lazy list:

`data Queue2 a = Q2 !Int [a] !Int !(StrictList a)`

The remainder of the implementation is the same as it was for `Queue1`

, except that reverse now needs to take a strict list as input and return a lazy list as result:

```
rev' :: StrictList a -> [a]
rev' = go []
where
go :: [a] -> StrictList a -> [a]
go acc SNil = acc
go acc (SCons x xs) = go (x:acc) xs
```

All the other changes are just changing the operations on strict lists to operations on lazy lists:

```
inv2 :: Queue2 a -> Queue2 a
inv2 q@(Q2 f xs r ys)
| f < r = Q2 (f+r) (xs ++ rev' ys) 0 SNil
| otherwise = q
instance Queue Queue2 where
empty = Q2 0 [] 0 SNil
head (Q2 _ (x:_ ) _ _ ) = x
tail (Q2 f (_:xs) r ys) = inv2 $ Q2 (f-1) xs r ys
snoc (Q2 f xs r ys) y = inv2 $ Q2 f xs (r+1) (SCons y ys)
```

The genius of this representation lies in two facts. First, notice that when we construct the thunk `(xs ++ rev' ys)`

, we know that the `rev' ys`

will not be forced until we have exhausted `xs`

. Since we construct this thunk only when the rear is one longer than the front, we are indeed justified in saying that the cost of the reverse is amortised O(1).

But what about reusing the same queue twice? This is where we rely crucially on laziness. Suppose we have a sequence of operations

```
Q2 4 [1,2,3,4] 4 [8,7,6,5] -- initial queue
Q2 9 ([1..4] ++ rev' [9,8,7,6,5]) 0 [] -- snoc (invariant restored)
Q2 5 (rev' [9,8,7,6,5]) 0 [] -- tail 4 times
```

While it is true that we might call `tail`

on this resulting queue any number of times, they will *not* each incur the full cost of `rev'`

: since these thunks will all be shared, laziness will make sure that once this `rev'`

has been evaluated (“forced”) once, it will not be forced again.

Of course, if we started from that initial queue and inserted various elements, then each of those would create a separate (not shared) thunk with a call to `rev'`

: but those calls to `rev'`

will only be forced if for each of those separate queues we first do `f`

calls to tail (in this case, 4 calls).

## From Amortised to Worst-Case Bounds

The queues from the previous section will suffice for lots of applications. However, in some applications amortised complexity bounds are not good enough. For instance, in real time systems having normally-cheap operations occassionally take a long time is not acceptable; each operation should take approximately the same amount of time, even if that means that the overall efficiency of the system is slightly lower.

There are two sources of delays in the implementation from the previous section. The first is that when we come across the call to reverse, that whole reverse needs to happen in one go. The second source comes from the fact that we might still chain calls to append; consider what happens when we insert the elements `[1..7]`

:

```
Q2 0 [] 0 []
Q2 1 r1 0 [] -- invariant restored, r1 = [] ++ rev' [1]
Q2 1 r1 1 [2]
Q2 3 r2 0 [] -- invariant restored, r2 = r1 ++ rev' [3,2]
Q2 3 r2 1 [4]
Q2 3 r2 2 [5,4]
Q2 3 r2 3 [6,5,4]
Q2 7 r3 0 [] -- invariant restored, r3 = r2 ++ rev' [7,6,5,4]
```

This is similar to the behaviour we saw for the queues based on a single list, except we now have a maximum of O(log *n*) calls rather than O(*n*), because the distance between two calls to `reverse`

doubles each time.

Intuitively, we can solve both of these problems by doing a little bit of the append and a little bit of the reverse each time we call `tail`

or `snoc`

. We need to reestablish the invariant when *r* = *f* + 1. At this point the append will take *f* steps, and the reverse *r* steps, and we will not need to reestablish the invariant again until we have added *r + f + 2* elements to the rear of the queue (or added some to the rear and removed some from the front). This therefore gives us plenty of time to do the append and the reverse, if we take one step on each call to `tail`

and `snoc`

.

### Progress

How might we “do one step of a reverse”? This is where we diverge from Okasaki, and give a more direct implementation of this idea. We can implement a datatype that describes the “progress” of an operation:

`data Progress = Done | NotYet Progress`

The idea is that we can execute one step of an operation by pattern matching on an appropriate value of type `Progress`

:

```
step :: Progress -> Progress
step Done = Done
step (NotYet p) = p
```

For `(++)`

it is easy to construct a `Progress`

value which will execute the append; all we need to do is force (part of) the spine of the resulting list:

```
forceSpine :: Int -> [a] -> Progress
forceSpine 0 _ = Done
forceSpine _ [] = Done
forceSpine n (_:xs) = NotYet (forceSpine (n-1) xs)
```

For other operations this is more difficult. We need some way to express a computation split into multiple steps. We can use the following datatype for this purpose:

`data Delay a = Now a | Later (Delay a)`

`Delay a`

is a computation of an `a`

, but we mark the various steps of the computation using the `Later`

constructor (this datatype is variously known as the delay monad or the partiality monad, but we will not need the fact that it is a monad in this blog post). For example, here is reverse:

```
revDelay :: StrictList a -> Delay [a]
revDelay = go []
where
go :: [a] -> StrictList a -> Delay [a]
go acc SNil = Now acc
go acc (SCons x xs) = Later $ go (x:acc) xs
```

We then need to be able to execute one step of such a computation. For this purpose we can introduce

`runDelay :: Delay a -> (a, Progress)`

which returns the final value, as well as a `Progress`

value which allows us to execute the computation step by step. The definition of `runDelay`

is somewhat difficult (see appendix, below), but the idea hopefully is clear: evaluating the resulting `Progress`

*n* steps will execute precisely *n* steps of the computation; if you look at the resulting `a`

value before having stepped the entire `Progress`

the remainder of the computation will run at that point.

Finally, we can execute two operations in lockstep by pattern matching on two `Progress`

values at the same time:

```
par :: Progress -> Progress -> Progress
par !p Done = p
par Done !p' = p'
par (NotYet p) (NotYet p') = NotYet (par p p')
```

### Real-Time Queues

We can use the `Progress`

datatype to implement real-time queues: queues where both insertion and deletion has O(1) worst case complexity. The representation is much like we used in the previous section, but we add a `Progress`

field (`Progress`

is an example implementation of what Okasaki calls a “schedule”):

`data Queue3 a = Q3 !Int [a] !Int !(StrictList a) !Progress`

Re-establishing the invariant happens much as before, except that we record the resulting `Progress`

on the queue:

```
inv3 :: Queue3 a -> Queue3 a
inv3 q@(Q3 f xs r ys _)
| f < r = let (ys', p1) = runDelay $ revDelay ys
xs' = xs ++ ys'
p2 = forceSpine f xs'
in Q3 (f+r) xs' 0 SNil (par p1 p2)
| otherwise = q
```

All that is left to do now is make sure we take a step of the background reverse and append actions on each call to `tail`

and `snoc`

:

```
instance Queue Queue3 where
empty = Q3 0 [] 0 SNil Done
head (Q3 _ (x:_ ) _ _ _) = x
tail (Q3 f (_:xs) r ys p) = inv3 $ Q3 (f-1) xs r ys (step p)
snoc (Q3 f xs r ys p) y = inv3 $ Q3 f xs (r+1) (SCons y ys) (step p)
```

## Conclusions

It is difficult to develop data structures with amortised complexity bounds in strict but pure languages; laziness is essential for making sure that operations don’t unnecessarily get repeated. For applications where amortised bounds are insufficient, we can use an explicit schedule to make sure that operations get executed bit by bit; we can use this to develop a pure and persistent queue with O(1) insertion and deletion.

In his book, Okasaki does not introduce a `Progress`

datatype or any of its related functionality; instead he makes very clever use of standard datatypes to get the same behaviour somehow implicitly. Although this is very elegant, it also requires a lot of ingenuity and does not immediately suggest how to apply the same techniques to other datatypes. The `Progress`

datatype we use here is perhaps somewhat cruder, but it might make it easier to implement other real-time data structures.

Random access to (any of the variations on) the queue we implemented is still O(*n*); if you want a datastructure that provides O(1) insertion and deletion as well as O(log *n*) random access you could have a look at Data.Sequence; be aware however that this datatype provides amortised, not real-time bounds. Modifying `Sequence`

to provide worst-case complexity bounds is left an exercise for the reader ;-)

## Appendix: Implementation of `runDelay`

The definition of `runDelay`

is tricky. The most elegant way we have found is to use the lazy ST monad:

```
runDelay :: Delay a -> (a, Progress)
runDelay = \xs -> runST $ do
r <- newSTRef xs
x <- unsafeInterleaveST $ readSTRef r
p <- next r
return (runNow x, p)
where
next :: STRef s (Delay a) -> ST s Progress
next r = do
xs <- readSTRef r
case xs of
Now _ -> return Done
Later d -> do writeSTRef r d
p' <- next r
return $ NotYet p'
runNow :: Delay a -> a
runNow (Now a) = a
runNow (Later d) = runNow d
```

In the lazy ST monad effects are only executed when their results are demanded, but are always executed in the same order. We take advantage of this to make sure that the calls to `next`

only happen when pattern matching on the resulting `Progress`

value. However, it is crucial that for the value of `x`

we read the contents of the `STRef`

only when the value of `x`

is demanded, so that we can take advantage of any writes that `next`

will have done in the meantime.

This does leave us with a proof obligation that this code is safe; in particular, that the value of `x`

that we return does not depend on *when* we execute this `readSTRef`

; in other words, that invoking `next`

any number of times does not change this value. However, hopefully this is relatively easy to see. Indeed, it follows from parametricity: since `runDelay`

is polymorphic in `a`

, the only `a`

it can return is the one that gets passed in.

To see that pattern matching on the resulting `Progress`

has the intended effect, note that the ST ref starts with “cost *n*”, where *n* is the number of `Later`

constructors, and note further that each call to `next`

reduces *n* by one. Hence, by the time we reach `Done`

, the computation has indeed been executed (reached the `Now`

constructor).

Note that for the case of the queue implementation, by the time we demand the value of the reversed list, we are sure that we will have fully evaluated it, so the definition

`runNow (Later d) = runNow d`

could actually be replaced by

`runNow (Later _) = error "something went horribly wrong!"`

Indeed, this can be used to debug designing these real time data structures to ensure that things are indeed fully evaluated by the time you expect them to. In general however it makes the `runDelay`

combinator somewhat less general, and strictly speaking it also breaks referential transparency because now the value of `x`

*does* depend on how much of the `Progress`

value you evaluate.

For more information about the (lazy) ST monad, see *Lazy Functional State Threads*, the original paper introducing it. Section 7.2, “Interleaved and parallel operations” discusses `unsafeInterleaveST`

.