Objects with special collection routines in GHC's GC

Ömer Sinan Ağacan – Tuesday, 15 May 2018

A generational copying garbage collector, in its most basic form, is quite simple. However, as we’ll see, not all objects can be copied, and some objects require more bookkeeping by the RTS. In this post we’re going to look at these type of objects that require special treatment from the garbage collector (GC in short). For each type of object we’ll look at how they’re allocated and collected. Each of these objects solves a particular problem:

Large objects solve the problem with expensive copying of e.g. large arrays.
Pinned objects solve the problem of sharing data with foreign code.
Compact objects solve the problem of expensive scanning of large boxed heap objects.
Weak pointers allow finalizers and non-leaky memo tables.
Threads with special collection rules allow BlockedIndefinitelyOnMVar async exceptions.
Stable pointers solve the problem with sharing references to Haskell objects with foreign code.
Stable names allow pointer equality.
Static objects allow efficient collection of data allocated in compile-time by not allocating them in the heap.

This post is mostly intended for GHC hackers, but may also be useful for Haskell developers who develop performance-critical systems and need to know about impacts of these objects to GC and allocation pauses.

1. Large objects

A unit of allocation in GHC’s storage manager is a block, which is 2¹² bytes (4k) large. A large object is an object that is at least as large as 80% of a block (3277 bytes). Because these objects are expensive to copy, GHC’s RTS treats them as immovable, and does not use the usual copying GC routines to collect them.

Allocation: Allocation is done via the RTS function allocateMightFail. If the size parameter is smaller than 80% of a block allocateMightFail allocates in the nursery. Otherwise it allocates enough blocks to hold the requested amount, marks allocated blocks as BF_LARGE, adds the blocks to the first generation’s large_objects list. One example where this happens is when newArray# is called with large-enough size argument.

Because large object allocation requires allocating new blocks, this requires synchronization via sm_mutex (sm for “storage manager”).

Collection: When we see a large object during GC (i.e. a block with BF_LARGE flag set) we remove the object from its generation’s large_objects list¹, add it to the GC’s todo_large_objects list. todo_large_objects list is later scanned by scavenge_large, which adds large objects to their next generation’s scavenged_large_objects lists. Finally, after the GC is done, scavenged_large_objects lists are appended to large_objects lists of their generations.

Effectively we move large objects that we found during the usual copying GC to their next generations without copying.

2. Pinned objects

Pinned objects are similar to large objects; they’re also not moved by the GC. However, unlike large objects, they don’t have a minimum size. The RTS maintains capability-local pinned block lists, and allocates pinned objects in these blocks. Currently the only pinned object type is the MutableByteArray#, which is allocated via newPinnedByteArray#.

Allocation: Allocation is done via the RTS function allocatePinned. If we have enough space in the capability-local pinned object block we allocate there, otherwise we allocate a new block and start using it as the new pinned object block. The old block is added to the capability-local pinned_object_blocks list, which will be later moved to the first generation’s large_objects list for collection.

Pinned object allocation only requires synchronization (via sm_mutex) when the current capability’s pinned object block does not have enough space.

Collection: When allocating pinned objects we cheat a little bit, and mark all pinned blocks as BF_LARGE. This makes the GC collect pinned blocks the same way that it collects large blocks. collect_pinned_object_blocks moves all capabilities’ pinned_object_blocks to the first generation’s large_objects list.

One another difference between a pinned block and a large block is that a pinned block may have many objects, while a large block is just one object. However, this does not cause problems when collecting a pinned block as if it was a large block, because the primop for pinned object allocation returns a mutable byte array, which does not contain any pointers and does not provide an API for adding arbitrary heap objects to the array. This also means unlike large objects pinned objects are not scavenged.

3. Compact objects

Compact objects are similar to pinned and large objects in that they’re not moved. Unlike pinned and large objects, they can have multiple blocks linked together (whereas pinned and large objects are contiguous block groups). As far as GC is concerned a compact object is just one object with no outgoing pointers, so the fact that they can have multiple linked block groups is not important: we only worry about liveness of the object as a whole, and there is no need to scavenge it.

See the paper for some example uses of compact objects.

Allocation: Allocation is done via compactNew RTS function. compactNew adds the compact block to the first generation’s compact list. compactAdd copies an object (with everything referenced by it) to a compact by allocating in the compact’s existing blocks, allocating new blocks if necessary (code).

Compact object blocks are marked as BF_COMPACT. Compact object allocation requires synchronization via sm_lock.

Collection: When we see an object that lives in a block with BF_COMPACT we get the head of the block (remember that blocks for a compact object are linked together) and evacuate the head block by removing it from its current compact list adding it to the next generation’s live compact objects list (code). After GC is done any compact objects left in compact lists are collected and live compact objects are moved to compact lists.

4. Weak pointers

A weak pointer (or weak object) is essentially a record with key, value, and finalizer fields. While all of these fields are ordinary Haskell objects, they’re not marked as pointer fields in the weak object info table² ³, so a reachable weak object does not make these fields reachable.

As to why this is useful, see the paper and System.Mem.Weak module documentation.

Allocation: Weak pointers are allocated via mkWeak and mkWeakPtr. These high-level functions use the mkWeak# primop. mkWeak# allocates a heap object for the actual Weak value, and adds it to the current capability’s weak pointer list.

Collection: As already noted, because fields of weak objects are not marked as pointers, we do not copy its fields. Instead we do this:

Right before starting evacuating the roots, move Capability-local weak pointers to the first generation’s weak pointer list. (code)
After evacuating the roots but before scavenging, move weak pointer lists of collected generations to old_weak_ptr_lists. (code)
During scavenging, traverse old_weak_ptr_lists of collected generations, and for each weak pointer check if the key is alive (code) (keys in uncollected generations are considered alive). If it is then evacuate other fields of the weak object, and promote it to next generation’s weak_ptr_list. Otherwise just skip it (do not detach it from the list!).

If in this step we evacuated anything (i.e. found an alive key) we signal the scavenging code to continue scavenging, because we have to copy any heap objects reachable from the weak object’s fields.

If we traverse all of the weak pointer lists without evacuating anything then we’re done. We append all old_weak_ptr_lists to the global dead_weak_ptr_list while also evacuating their finalizer fields. dead_weak_ptr_list is later used to schedule threads to run finalizers.

This effectively implements our special GC strategy for weak objects: a weak object does not make its key or value reachable, but if its key is reachable (even if the weak itself is not reachable) then the value is also reachable. When the key becomes unreachable we schedule finalizers.

5. Threads

A GHC thread is a heap object like any other. However, when a thread becomes unreachable we sometimes want to raise an async exception in the thread instead of collecting it, effectively resurrecting the thread.

What does it mean for a thread to become unreachable? The reachability rules for threads are:

If ThreadId for a thread is reachable, then the thread is also reachable ⁴.
If a thread has work to do and not blocked then it’s reachable. This is because threads with work to do are in capabilities’ run queues, which are roots for GC.
If a thread is in a blocking queue (e.g. when blocked by an MVar operation), and the object that owns the blocking queue is reachable, then the thread is also reachable.

Now suppose that a thread was blocked in an MVar operation, and its ThreadId is unreachable. In this case we want to raise a BlockedIndefinitelyOnMVar exception in the thread. This suggests that we have to somehow know about threads that become unreachable. So we maintain a list of threads. Before moving on to the details, here’s a simple program that demonstrates the reachability rules:

{-# LANGUAGE ScopedTypeVariables #-}

module Main where

import Control.Concurrent
import Control.Exception
import System.Mem

main :: IO ()
main = do
    var :: MVar Int <- newEmptyMVar

    thr <-
      forkIO $
        catch (takeMVar var >>= print)
              (\(e :: BlockedIndefinitelyOnMVar) ->
                   putStrLn "Blocked")

    performMajorGC
    yield
    performMajorGC
    threadDelay 1000000

This version should print a single Blocked line. This is because both the MVar and the thread become unreachable during the first GC and the thread is resurrected⁵.

Now, add seq var (return ()) or seq thr (return ()) at the end. In both cases we don’t get an exception in the thread because it’s still reachable, and thus not resurrected.

Now, the details.

Allocation: A thread is allocated with forkIO, which calls fork# primop, which in turn calls createThread RTS function. createThread allocates the thread in the nursery, and adds the thread to the first generation’s threads list ⁶. Thread allocation requires synchronization via sched_mutex.

At this point we have a thread but it’s not scheduled yet. For this the primop calls scheduleThread RTS function with the return value of createThread. scheduleThread adds the thread to the current capability’s run queue. At this point the thread becomes a GC root.

Collection: Collection works similarly with weak object collection. Before starting the GC, for collected generations, we move the threads lists to old_threads lists. Any threads that will be in the old_threads lists will need resurrection if they’re blocked (otherwise they will be collected).

During scavenging, we traverse old_threads lists, moving any alive threads to their next generation’s threads lists (code). Note that this process interleaved with step (3) of weak object collection, because a weak object with alive key may make a thread reachable (by referencing to it in its value).

When we don’t find any more weaks with live keys we “resurrect” threads in the old_threads lists if they were blocked in an operation (rather than being killed or finished). This happens in resurrectUnreachableThreads. Because resurrecting makes some unreachable objects reachable again (by adding threads to run queues) we continue with scavenging after this process.

Aside: The way blocked threads are resurrected causes somewhat counter-intuitive behavior when multiple threads are blocked but resurrecting one thread unblocks others. Apparently this behavior is discovered by users and reported as a bug from time to time (see e.g. #10241). The code from #10241 demonstrates the issue (slightly simplified):

module Main where

import Control.Concurrent
import Control.Concurrent.MVar
import Control.Exception

main :: IO ()
main = do
    mvar1 <- newEmptyMVar :: IO (MVar ())
    mvar2 <- newEmptyMVar :: IO (MVar ())

    forkIO $ do
        takeMVar mvar1 `catch` errorHandler1
        putStrLn "after error catch"
        threadDelay 1000000
        putMVar mvar2 ()
        putStrLn "after putMVar"

    readMVar mvar2 `catch` errorHandler2
    putStrLn "after readMVar"
    threadDelay 5000000
  where
    errorHandler1 :: BlockedIndefinitelyOnMVar -> IO ()
    errorHandler1 e = putStrLn ("errorHandler1: " ++ show e)

    errorHandler2 :: BlockedIndefinitelyOnMVar -> IO ()
    errorHandler2 e = putStrLn ("errorHandler2: " ++ show e)

The idea is that while both threads are blocked, resurrecting one of them unblocks the other, so we should see only one “errorHandler…” line in the output. However, because all dead threads are resurrected once, both threads end up getting a BlockedIndefinitelyOnMVar exception.

There’s a simple fix that I explained in the ticket and implemented, but it’s rejected because it’s non-deterministic in which thread to resurrect.

Personally I think this non-determinism is OK because current behavior is worse, and scheduling (and thus GC and BlockedIndefinitelyOnMVar exceptions) are already non-deterministic, so I doubt anyone relies on determinism in such programs.

6. Stable pointers

A stable pointer is simply a reference to a Haskell object. However, they’re never collected, and after every GC stable pointers are updated to point to new locations of the Haskell objects that they were pointing to before the GC. This is useful when storing references to Haskell objects in foreign code.

Allocation: Done via getStablePtr RTS function. Stable pointers are held in an array of spEntry (which is just typedef for a pointer). New pointers are simply added at the end of the array, and array is resized (capacity is doubled) when it’s out of room.

Collection: The stable pointer table is never collected. However, its contents are evacuated as GC roots, and after every GC addresses of objects in the table is updated.

GHC itself uses stable pointers in an interesting way: some closures are used by the RTS, but not by generated code. Because these closures are not referenced by the generated code the GC is free to collect them. To prevent this the RTS adds them to the stable pointer table, effectively making them roots. This way they’re never collected and the RTS can use them.

7. Stable names

A stable name just some abstract object unique to a Haskell object. They come with a fast Eq instance and a hash function, and when you generate stable name for an object multiple times you get the same stable name back. This means stable names can be used for implementing something similar to pointer equality.

Allocation: Done via makeStableName# primop. Stable names are held in an array similar to stable pointers, however to enable collection the entry type is more complex. I updated comments of this type while writing this post so give it a read for details.

Collection: Because we keep a list of all stable names, and stable names know addresses of their StableName objects (sn_obj field), after a GC we can check if a stable name is still alive. If not then we free its entry in the table. Otherwise we evacaute the pointed object. This happens in gcStableTables.

Note that stable names, unlike stable pointers, are not roots. gcStableTables is called after GC is done.

8. Static objects

Top-level definitions are allocated as static objects. The allocation happens in compile time, outside of the heap.

Allocation: Allocation happens in compile time. It’s not possible to allocate a static object in runtime.

Collection: The space allocated for static objects is never really reclaimed. When the GC encounters a static object⁷ it adds the static object to the GC-thread-local static objects list. This list later used by scavenge to evacuate any objects referenced by these static objects.

To avoid collecting a static object multiple times, evacuated static objects are tagged with a flag. Before a major GC the flag is swapped so that evacuate_static can distinguish static objects flagged in the current GC from those flagged in the previous GC.

Static objects are only collected in major GC.

Conclusion

Most of these objects require a combination of global data structures (like stable name and pointer tables), capability-local lists (large objects, pinned objects, weak pointers, threads) and generation-local lists (large objects, pinned objects, weak pointers, compact objects, threads). Among these objects, most of the complexity is caused by weak pointers and threads, because of the special reachability rules. When these are added the runtime becomes quite complex but also useful.

This is a doubly-linked list so this operation is efficient.↩︎
The info table is defined here. The 1 for one pointer field for C finalizer list, 4 is for four non-pointer fields for the key, value, Haskell finalizer, and a field to make an intrusive linked lists from weak objects.↩︎
The C finalizers field is used by addCFinalizerToWeak# primop. I don’t know what high-level functions use this. C finalizers are like Haskell finalizers, but they’re more reliable in that they’re always called eventually when key of a weak objects becomes unreachable (in the worst case when the Haskell process terminates).↩︎
This is actually a “misfeature”. Ideally a ThreadId would only keep the thread alive when it still has work to do.↩︎
Why do we need two performMajorGCs and a yield in between? I don’t understand this yet, but I can see in gdb that the thread is definitely resurrected during the first GC and added to the run queue. The reason why we need the other stuff is probably related with the scheduler.↩︎
The name choice here is terrible because it’s making searching for uses impossible. For this reason I have a commit that renames this to thread_list which I cherry-pick to my work branches.↩︎
This is done via HEAP_ALLOCED macro, which checks if the object is outside of heap.↩︎