Updated at Feb 2, 2017

This is adapted from transcript for a talk.

In this post, I’d like to introduce some introductory knowledge about GC, i.e. garbage collection.

First, let’s take an overview:

Memory is a crucial resource to get a program run. While OS grants your program many raw space through some basic allocation interfaces (virtual memory mechanism + memory allocation sys-calls), how well you can utilize the memory space is your own problem. Here, by “how well”, I mean:

  1. Correctness: Is every read/write a well-defined operation? Will you program exit because of segment fault etc.?
  2. Efficiency: How well can you reduce the total amount of memory you need (by clever reuse etc.)?

Not just heap has such problems. Although stack is managed by hardware and OS, there are still a lot of correctness-related problems as well [1]. On the other hand, heap is managed by user (mostly through language-specific library) and is more amenable to problems in the second basket, i.e., efficiency.

Let’s focus on heap. Depending on OS, the most primitive way to allocate heap memory can vary. For example, in Unix/Linux, we have brk and sbrk.

#include <unistd.h>

int brk(void *end_data_segment);

void *sbrk(intptr_t increment);

brk sets the break line above the heap, and sbrk can be used to increase/decrease it or get its current value with sbrk(0) (a classical HACK I believe).

The malloc, free et cetera are implemented on top of these basic sys-calls. This layer is called allocator. We have a lot of implementations, such as jemalloc, glic, tcmalloc … but the their interfaces are similar.

Some system chooses to add its own allocator layer and maintains a pool of blocks to improve efficiency, like slab in Linux kernel and object-specific allocators in Python. But in this layer, the pointer is still unmanaged. Users have to explicitly allocate memory with malloc and deallocate with free, or let compiler decide the size (e.g. new and delete in C++).

This leads to a lot of classical bugs, such as:

  1. Memory leak: Some memory is allocated but we ended up not freeing it before the termination of program.
  2. Dangling pointer: Some memory pointed by p is deallocated, but p is dereferenced later.

If you just don’t delete, you won’t get dangling pointer, but might cause memory leak; In contrast, if you delete too eagerly, delete something still alive in other places, you will have dangling pointers.

There are many approaches to guarantee both soundness and efficiency of memory management automatically, and GC is one of them (Rust took another path: type system).

Garbage collection solves these problems by enforcing several properties (or invariants) and manage the pointers for users:

  1. Any reference that can be accessed directly (by stack or global data, we call them roots) or indirectly from roots, points to allocated memory. We call such references valid.
  2. Any memory not accessible either directly or indirectly, should be reclaimed as soon as possible.

Garbage collection concepts

Basically, we can view the world of managed pointers as a directional graph: A -> B means A holds a reference to B. Here, what does “reference” mean? Take it concretely, “reference” should be a typed pointer to another object.

The reference might be the raw pointer, for example, a 32-bit long address; or a structure containing the pointer (The structure might have other fields indicating the attributes of the object being pointed to etc.).

Here I stressed that a reference should also be typed. What does that mean? It means that, we can know what the pointer points to at compile time. Only then, can we write code like p->next and hope it will be compiled as an efficient load.

Formally, denote symbol table as $\Gamma$, we wish to have $\forall p \in Roots, \Gamma \vdash p : \tau$.

It might reminds you of Python or JavaScript, which doesn’t really has a static type system [5]. It is object-based and “type” of object is determined dynamically. And when we are referencing one object from another, we must do runtime indirect jumping. Let’s compare p->q->d in C and p.q.d in python. Suppose p, q, d are all typed structs so we can represent the address of d by *(*(p + offset) + offset) + offset. But by p.q.d, we must interpret like this: (*p.attr_table.get_value_by_key("q")).attr_table.get_value_by_key("q"). So I think I don’t have to remind you the overhead incurred by using a runtime hash table. And it is also really unsafe (note 1: duck typing) (note 2: we can use aggressive optimisations to reduce the overhead though).

But you might say, collector runs at runtime. It doesn’t have access to compile-time information, how could it know the structure type to do traversing fast and correctly?

Answer: Emit traversal code per type at compile time.

For example, let’s say there is a x: Gc<T> on stack. In C, it should be implemented as something like:

struct Gc_T {
    void *traverse_T(T*, void *(void*));
    // other auxiliary fields
    void *T;
};

Its first field is a pointer to a function traverse_T, which needs a pointer to T to traverse T structure, and a function which doing real GC job per node.

Actually, we have another way of doing the same thing, by not keeping a function but a table and traverse T based on offsets information documented in the table. But that could be a problem for growable lists and some recursive data types. Concretely, for example, you write a growable vector, then how possible would it be to maintain the table for all existing vectors? But if what we maintain is a function, things are much simpler:

void traverse_vector(void* v, void *f(void *)){
    Vector *vec = (Vector *)v;
    for(int i = 0; i < vec.size(); i++) {
        (*f)(vec.at(i).raw); // marking et cetera
        (*(vec.at(i).traverse))(vec.at(i).raw, f);
    }
}

You might worry the call-stack might become too deep. But we can pass the discovered pointers in a work queue instead of f, and use a loop to consume the work queue. So the stack won’t grow, although some extra space overhead is required.

Now, we think we are finally done with the trace problem. This is why data structure is important in GC implementation. The algorithms are almost always intuitive, but devils are in details.

Why data structure is important?

Like all other systems, there are very limited literature on how it works, and how should we implement them correctly, while being readable enough.

So it can be a pain trying to dig into the world of GC by yourself. For books such as The Garbage Collection Handbook, there are a lot of concepts, ideas, and pseudo-code in it, but nearly no complete implementation, or even a toy provided.

Suppose that you just want to play with the Mark-Sweep algorithm. Then you must at least know, how to reach other cells from a given cell. In pseudo-code algorithm, it is simply

for reachable in reach_from(cell)

but in implementation, it can be far-more complicated. First, it is natural to ask for a generic support for all possible structures, for example:

struct LinkedList<T> {
    data: T,
    next: Option<Gc<T>>,
}
	
// Or
	
struct Pair<U, V> {
    fst: Gc<U>,
    snd: Gc<V>,
}

Here, the Gc is a wrapper representing a collector-managed pointer to some type of data. Apparently, for different kinds of user structure, the reachability is different. So we have to write boilerplate dispatch code + recursively reaching routine for each kind of user structure.

If we don’t do boilerplate dispatch (or generic programming), we must use dynamic dispatch, i.e. hardcode type information into the real Gc cell’s header. It is an overhead in space and we still have to recover type information recursively. Or else we can put all pointing-out pointers into the headers, so we can write less generic but more general code. But it will bloat the object and somehow slow down the tracing speed.

In any case, you have to design the data structures by your need.

Algorithms

The commonly seen GC algorithms can be categorized into three genres:

  • Mark-sweep
  • Copying
  • Reference-counting

You might be surprised that RC is also one of them. But it is not the naïve RC ;)

The following sample code is adapted from The Garbage Collection Handbook [6]. They are used because they looks intuitive.

Mark-Sweep

New():
    ref <- allocate()
    if ref = null
        collect()
        ref <- allocate()
        if ref = null
            error "Out of memory"
    return ref

atomic collect():
    markFromRoots()
    sweep(HeapStart, HeapEnd)
        

markFromRoots():
    worklist <- empty
    for each fld in Roots
        ref <- *fld
        if ref != null && not isMarked(ref)
            setMarked(ref)
            add(worklist, ref)
            mark()

mark():
    while not isEmpty(worklist)
        ref <- remove(worklist)
        for each fld in Pointers(ref)
            child <- *fld
            if child != null && not isMarked(child)
                setMarked(child)
                add(worklist, child)

sweep(start, end):
    scan <- start
    while scan < end
        if isMarked(scan)
            unsetMarked(scan)
        else
            free(scan)
        scan <- nextObject(scan)

A possible set of underlying data structures for this algorithm:

  • Objects are allocated consecutively in heap
  • Every object cell have a header including:
    • isMarked: bool
    • isRoot: bool
    • nextObject: void*
    • traceFunction: void *

So for mutator, it calls Gc::new(T()) to construct a Gc<T>. During the construction, if there is enough space, then we will simply allocate a properly-sized cell at the tail; isRoot will be initialized to true as well.

If the space is not enough, we will invoke collect().

See rust-gc in Rust as a reference.

Copying

createSemispaces():
    tospace <- HeapStart
    extent  <- (HeapEnd - HeapStart) / 2
    top <- fromspace <- HeapStart + extent
    free <- tospace

atomic allocate(size):
    result <- free
    newfree <- result + size
    if newfree > top
        return null
    free <- newfree
    return result

atomic collect():
    flip()
    scan <- free
    for each fld in Roots
        process(fld)
    while not scan == free
        ref <- scan
        scan <- scan + size(scan)
        scan_flds(ref)

flip():
    fromspace, tospace <- tospace, fromspace
    top  <- tospace + extent
    free <- tospace

scan_flds(ref):
    for each fld in Pointers(ref)
        process(fld)

process(fld):
    fromRef <- *fld
    if fromRef != null
        *fld <- forward(fromRef)

forward(fromRef):
    toRef <- forwardingAddress(fromRef)
    if toRef = null
        toRef <- copy(fromRef)
    return toRef

copy(fromRef):
    toRef <- free
    free <- free + size(fromRef)
    move(fromRef, toRef)
    forwardingAddress(fromRef) <- toRef
    return toRef

First, let’s introduce tri-colour abstraction. In a lot of algorithms, we like to use colors to represent state of data. Here in the object graph, we partition objects by different colors:

  1. white: initial state
  2. grey: encountered, during processing
  3. black: children are identified, finished processing

You can image two pointers: free and scan. And after flipping, free points to next place where object from fromspace can be copied to, and scan represents the scanning progress (i.e. from gray to black in tri-colour abstraction jargo). The objects from scan to free are those copied but yet to be processed (gray ones).

Another key function is forwardingAddress. The fld in process(fld) is a pointer field, so by dereferencing, we trace to the pointed cell.

Well, this is a bit hackish. I personally would perfer that the every cell in space reserves a header indicating if it stores a forwarded address or the real object. So we don’t have to face the ambiguity here (pointer detection is also a topic in GC, you maybe see 11.2 of The Garbage Collection Handbook [6]).

Reference Counting

New():
    ref <- allocate()
    if ref = null
        error "Out of memory"
    rc(ref) <- 0
    return ref

atomic Write(src, i, ref):
    addReference(ref)
    deleteReference(src[i])
    src[i] <- ref

addReference(ref):
    if ref != null
        rc(ref) <- rc(ref) + 1

deleteReference(ref):
    if ref != null
        rc(ref) <- rc(ref) - 1
        if rc(ref) = 0
            for each fld in Pointers(ref)
                deleteReference(*fld)
            free(ref)

The algorithm is easy to understand. The problem of ref-count, as we all know, is that it’s unable to detect unreachable cyclic subgraph. However, that doesn’t mean that RC is a dead-end.

One idea is to assist the RC with an infrequent tracing collection. Another idea is to probe the subgraph: start from some node we do tracing deletion, then see if every node in the subgraph is unreachable (i.e. its reference count drops to zero), if yes, then there is no outside referencing so the whole sub-graph can be deleted.

Performance Problem

For different kinds of algorithms, some specialized algorithms and structures are proposed to improve the performance. For example, to improve cache performance, we might use a side table (such as a bitmap) to store mark bits. First, the bitmap in one place won’t contend for the cache lines. Second, we don’t have to modify mark bit inside object itself, but only need to read object – so less cache lines will get dirty.

And for copying collection, locality can be improved by studying different tracing orders – depth first or breadth first. By putting related objects as near as possible, we can have better locality.

For the RC, the main problem is too frequent R/W over reference count. We can use a deferred RC, in which only operation on heap-objects are performed instantly, and stack objects’s RC adjustments are batched.

Concurrent & Parallel GC

Here, I will give a simple introduction of the parallel and concurrent GC.

First, there is difference between parallel GC and concurrent GC.

  • By being parallel, we mean that mutator is stopped, and several collector threads can work simultaneously, which can speed up the collection process, but don’t solve the pausing problem entirely.
  • By being concurrent, we mean that the collection doesn’t need mutator to be stopped to continue collection. Here is a graph illustrating the subtle difference between them.

Like every concurrent system, we have to deal with:

  1. Sharing: How many user threads share how many collector threads? How many collector threads per physical processor?
  2. Synchronization: How to synchronize the mutation with collection? Is it possible for different threads having a consistent view of referencing relationships?

“The lost object problem” is one example of how concurrent GC would go wrong. Look at the picture.

Here, the mutator interferences the wavefront property of tricolour abstraction, thus causing object to be lost.

So, there are several techniques can be used to prevent problems:

  • Grey mutator barrier: For example, Steele’s barrier will detect if we connect white node to black one, if so, the black one will be reverted back to gray, and these operations are atomic.
  • Black mutator barrier: For example, Backer’s barrier will promise the returned ref will be shaded grey if it is the wavefront.

It is mutator’s part to prevent live objects from being hidden from the collector.

We won’t talk much about concurrent GC here. You might want to look at Baker’s algorithm for a concurrent copying collector. These are years of research’s result.

References

  1. One, Aleph. “Smashing the stack for fun and profit.” Phrack magazine 7.49 (1996): 14-16.
  2. JavaScript data types and data structures, https://developer.mozilla.org/en/docs/Web/JavaScript/Data_structures
  3. Jones, Richard, Antony Hosking, and Eliot Moss. The garbage collection handbook: the art of automatic memory management. Chapman & Hall/CRC, 2011.