Updated at Feb 2, 2017
That said, there are a lot of personal comments and narratives added, while ignoring many aspects originally in slides for various reasons. Although I will try to make this post self-contained, it is generally recommended to read the slides and papers attached in the above link at the same time if you are interested.
We depend on software every day to do business, drive infrastructure, and support communications etc., but as everything else, software can have drastically different qualities.
We have several metrics for measuring software quality:
- Reliability: How often does the system fail? Here, by fail we mean the moment that system is down, or crashed.
- e.g. MTTF (Mean Time To Failure)
- Availability: How often is the system available? Here, by being available we mean the status that system is working
- e.g. Available Time / Total Time
- Dependability: Availability + Reliability + Security. It is a more comprehensive metric
Note that reliability is not equivalent to availability. For example, by monitoring the system status and do quick restart or route requests to hot backup when system failed, we can improve the overall availability despite poor reliability.
Also, there are many reasons that leads to computer system failures:
- Hardware problems: For example, CPU overheating, hard drive disk is corrupted, some bit in RAM is flipped by cosmic ray etc.
- Software problems (our focus in this post): such as segment fault, bus error, integer overflow etc.
- Operator/configuration problems: For example, you plugged out the power accidentally, or
sudo rm -rf folder / fileetc.
Another interesting question is how to compute the system availability based on components’ availability. The answer should depend on your system configuration:
As the picture shows, if your system does not depend on each other but working in a parallel style, then .
But if they depend on each other, then
Also, to better study the problem, we can split it into stages (the naming of stages here is ambiguous – you just need to grasp the meanings behind them):
- Fault: bug in source code
- Error: the direct effect of bug, such as segment fault
- Failure: the behavior of system that user perceived, such as blue-screen
Now we turn to a more specialized topic – software bugs. I think I don’t need to stress their importance, so how could we fight them? There are several dimensions:
- Constraints: Under what conditions will the bug be triggered?
- Types: including
- Memory bugs
- Semantics bugs
- Concurrency bugs
- Performance/energy bugs
- In-house (during development, requires high-accuracy):
- Bug avoidance
- Detection & Testing
- In-field (during production, requires low-overhead):
- Bug prevention
- Failure recovery
- Failure diagnosis
- Bug fixing
- In-house (during development, requires high-accuracy):
One important genre of software bugs is related to memory issues. Such as:
- Memory leak
- Uninitialized reads
- Dangling pointers
- Double free
- Null pointer dereference (controversy here)
Memory leak detection
We can do dynamic instrumentation:
Or static checking:
They both have pros and cons, and in my opinion, dynamic approaches are more precise, but the result is hard to interpret and it also suffers from poor coverage due to dependency on input and environment.
The static approaches are struggling between the conflict of precision and coverage, while it enjoys more generality since it doesn’t depend on specific inputs.
Buffer overflow detection
chat* p = malloc (10); p = 'A';
To detect overflow as shown above, we can insert instructions in
malloc to record each allocation, including the site and size, then for each read, we insert a read barrier to check if the offset is outside the pre-allocated range.
Here are some excepts from Valgrind documentation.
At one end of the scale, Memcheck adds code to check every memory access and every value computed, making it run 10-50 times slower than natively. At the other end of the spectrum, the minimal tool, called Nulgrind, adds no instrumentation at all and causes in total “only” about a 4 times slowdown.
Valgrind’s malloc, realloc, etc, add padding blocks before and after each heap block allocated by the program being run. Such padding blocks are called redzones. The default value for the redzone size depends on the tool. For example, Memcheck adds and protects a minimum of 16 bytes before and after each block allocated by the client. This allows it to detect block underruns or overruns of up to 16 bytes.
Increasing the redzone size makes it possible to detect overruns of larger distances, but increases the amount of memory used by Valgrind. Decreasing the redzone size will reduce the memory needed by Valgrind but also reduces the chances of detecting over/underruns, so is not recommended.
Better detection approach
First, a hybrid approach always works better. Second, you have do some hack to lower the overhead incurred by runtime checks.
The hack might involve Hardware/OS support, such as
Also, even if bad things happen, how we can survive them (to avoid system failure or security compromise) is another interesting topic:
I won’t cover many details on concurrency’s basics here, just to gather some interesting information from the lecture:
Source of non-determinism:
- Single-core: System event
- Multi-core: System event + Parallel hardware
Two types of thread synchronization:
- Lock: enforce mutual exclusion, or atomicity
- Condition variable: enforce pair-wise ordering
For example, you should use lock in case (1), and use condition variable in case (2):
The main root of concurrency bugs is the interleaving of execution between threads.
Note: weak memory model is another (harder) problem, but out of this post’s scope though.
Detecting concurrency bugs are notoriously hard, because of:
- Large state space: interleaving
- False positives: racy algorithm
- False negatives: complex coordinating semantics
- Overhead: scaling with the number of threads
The algorithm depends on ideas borrowed from general distributed system (L. Lamport et al.), in which there are two types of causality/happens-before relationship based on message passing model:
- Scalar timestamp (coarse)
- Vector timestamp (finer)
Also, here is the basic concept of logical time:
- Sequential Consistency (SC): operations within one thread happens in a sequential order
- Send/Recv: For certain message, receive event happens after send event.
- Transitivity: $A \rightarrow B \wedge B \rightarrow C \Rightarrow A \rightarrow C$
And the scalar clock serves like a global-single clock , e.g.:
while vector clock is finer-grained, giving each thread a slot , e.g.:
Now we describe the idea of happens-before detection algorithm based on an example program:
For example, in this picture, we interpret
lock(L) as first receiving msg from
L and block if the msg shows either a successful one by locking the unlocked
L or a failed one because
L is already locked, and only if it is the former case, will we do the time-stamping like receiving a message from
However, if the system’s actual trace is
ptr = malloc(10); ptr = NULL; ptr = 'a', there will be a bug. We can detect that violation by enumeration.
Problem: too many false negatives.
In this algorithm, the
Locked region is taken as an atomic block, so only two possibilities are permitted in the following code:
Problem: too many false positives
This pattern-based approach is formed on a study of 105 real-world concurrency bugs .
First, we categorize bugs into two kinds:
- AV: Atomicity violation (70%)
- OV: Order violation (30%)
which can themselves be further categorized into:
- Shared Single-variable (70%)
- Shared Multi-variable (30%)
First, atomicity violation = unserializable interleaving. For what does serializable mean, a post by Peter Bailis looks pretty nice to me.
Let me try to give an intuitive interpretation of the above equation: First, what do we hope from a transactional operation, like
a++? We hope that
a++ || a++ is equivalent to
a += 2! (
|| means concurrent execution). However, this transaction is composed of many (two here) operations:
tmp <- a; a <- tmp + 1 (think this as a readable x86 assembly). So, we can easily image the problematic case:
tmp1 <- a || || tmp2 <- a || a <- tmp2 + 1 a <- tmp1 + 1 ||
Which is a serialized version by interleaving, posing a different semantics from each transaction executing in a non-overlapping style. Then we name this as either atomicity violation, or unserializable interleaving.
So, if you try to interleave
OP X; OP X with
OP X, in which
OP can be
Read, and since there is only one way of interleaving, so you have 8 cases in total, in which four cases violate atomicity:
You may think the fourth case is ridiculous – how can that be wrong in any way? Well, see this :
a <- x || || a = 0 x <- a + 1 ||
Inference-based bug detection
The question is: Which code regions are expected to be atomic? There are several ways to do it:
- Manual annotation: Most popular, e.g.
synchronizedblock in Java
- Testing validation
A motivating example: the thread state might be accessed before the thread is created.
We can judge what the correct order should be by learning-based [Micro’09, OOPSLA’10] or semantic-guided [ASPLOS’11] techniques, and detection would be trivial since then.
The following example shows that
InProgress is a shared indicator of whether some global business is being done, and
URL is something to be processed. The
isBusy is local.
URL are behaviorally correlated, because they are semantically saying the same thing.
But without enough semantic information, we can use locality as a good heuristic: variable that are frequently accessed together are correlated. After that, we can merge the variables and reusing the single-var detectors.
The root-cause way might not be always working, so there is another way aiming at the effects:
- Statically identify potential failure/error site
- Statically look for critical read
- Dynamically identify buggy interleaving
Here is a graph for details :
Practically, these tracing can be done by static slicing etc.. As far as the locks and barriers, we can remove some data-dependence by analysis.
Overall, if any critical reads read in value that trigger the failures at runtime, it is very potential that a bug is here.
- Limited information (from user)
- Hard to reproduce (environment, interaction & input)
- Hard to interpret (what is root cause)
So, to evaluate a diagnosis approach, we have some metrics:
- Latency (How many runs is needed to identify the bug)
- Performance (how low is runtime overhead)
- Capability (how much information is available)
Compared with collect-all, we can use sampling technique and statistical analysis to post-process the collected information.
For example, Cooperative Bug Isolation (CBI) and Cooperative Con‐Bug Isolation (CCI):
Here, the predicate can be viewed as feature of program.
So, for concurrency, a predicate can be: Whether two successive accesses to a memory location were by two distinct threads or one thread. This won’t work well for OV, but good for AV.
Now, another question: how to do sampling?
- Location: Batch atomic blocks together
- Register-based hardware performance counters:
- Hardware events such as cache coherence protocol message might reflect the sharing behavior of memory.
- Last execution record (Leveraging the Short-Term Memory of Hardware to Diagnose Production-Run Software Failures):
- Only collect last few branches/cache-events right before failure
The basic idea is very like how we deal with blue-screen – reboot the computer. However, it is much finer-grained, instead of try to restart everything, it controls the effect by rolling back to a closely-before execution state.
So, here is the question:
- How to go back to a former state?
- Is the recovered program state correct (or consistent)?
- How to by-pass the failure during re-execution?
If we simply make whole-memory checkpoint periodically and roll back all threads, the cost might be too high. Some improvements might include:
- Roll back only some important threads
- Incremental checkpoint
Or even more aggressive is to consider the idempotent code block (which can be re-executed multiple times while keeping the single-time semantics), and we can jump back to the start of current idempotent code block without the need of restoring the program state.
Also, for atomicity violation bugs, simple re-execution could work pretty well (approx. 100%), but not for the order violation ones (approx. 50%).
Prevention means prediction, that we have to take measurements before the bad thing happens.
Failure prediction also involves some trade-offs:
- Being too early will lead to unnecessary performance losses
- Being too late will make failures inevitable
For anticipation, one solution is called Anticipating Invariant: For an instruction $i$, a fixed set of instructions $P$ are expected to precede it and touch the same variable from a different thread.
Fix concurrency bugs
The idea is to fix these automatically, and intuitively, we can add lock for atomicity violation and signal/wait for order violations. But you have to make sure that you add it right:
- Do not introduce new bugs, e.g. dead-lock
- Do not hurt performance too much, e.g. too much waiting time
- Do not hurt code readability too much
To correct add locks for two should-be-relatively-atomic blocks, we have to consider control flow: What if the entry point (where we
lock) is not enforced but the exit point is enforced (where we
So, a better solution would be doing some graph analysis to make sure that:
- For any possible trace, if the atomic block is involved, it should reside in the locked region
- For any possible trace, lock will be acquired and released exactly once
Now, let’s turn to the order problem: How do we enforce a certain ordering, i.e. make instruction A execute before B?
This problem becomes trickier if A is executed multiple times because
- Repeated execution per thread, or
- Multiple thread instances involving A
For each A thread, we can set a signal after where the loop involving A exits. And for the master thread, when it spawns each A thread, it will increase a shared counter; and then A thread exits, it will decrease the shared counter. Note that the spawning should be atomic to thread B, so thread B can wait until the shared counter goes back to 0.
Patch testing & improving
Well, welcome to the land of magic – Even if we might produce incorrect patches, it is not the end of the day because we can test the generated patches automatically! There are several goals:
- Prune incorrect patches
- Prune slow patches
- Prune unreadable (overly complicated) patches
But complexity can be reduced further through:
- Merge patches – for example, one lock might completely cover another lock’s range, so the smaller lock is unnecessary in the final patch.
- More constructs – we used to depend only on a minimal set of primitives (lock + signal/wait), which might looks terrible if layered together. Developers always uses some other more comprehensive constructs, such as
join, to solve the ordering problem – the code following
joincan only start executing after all spawned threads finished.
- Moving constructs – similar to merge, we can enlarge the synchronized block by moving the code (e.g.
Definition: Performance bugs are bugs that cause severe & unnecessary performance degradation for some inputs.
Personally the first time I met this concept I thought – could this also be called a “bug”? I would rather call it poorly designed algorithm or noob implementation or something. However, later on I started to find it sensible to some extent because the bugs studied here shared some very similar characteristics with the bugs about correctness, such as well-defined patterns, possibility of automating the detection and fixing etc.
That is to say, comparing sort is slower than quick sort, but comparing sort is not the performance bug under this context.
Reported performance bugs could include:
- Uncoordinated Functions
- Skippable Function
- Synchronization Issue
And in most cases, performance bugs are even harder to find and trigger because the users might not know how fast it should be. Only in some rare cases when performance degrades severely, will users be alerted.
Usually we use profiler to find such problems, but profiling result might be too noisy, and can’t directly tell us what causes the problem.
Performance bug detections
There are several commons patterns: such as nested loop (possibly implicit, e.g. inter-procedural).
The point for nested loops is that we repeated some work, i.e. doing same thing in each loop. We can find such repeating by analyzing the memory trace.
However, we also have to eliminate a lot of false positives caused by:
- initializer code (not frequent in itself)
- string appending etc.
Performance bug fixing
The survey showed that developer’s fixes of performance issues are always astonishingly simple – so the CARAMEL paper  focuses on producing simpler fixes. For example, the patterns of wasted computation found by CARAMEL include late exit from iteration. So we can add a conditional
break to exit earlier.
Further, the statistical debugging approach can also be applied in this case, but I won’t detail on this here.
Distributed System Bugs
Note: I am not familiar with distributed computing, so I will try to be brief in this part.
In distributed systems powered by softwares such as HBase, Hadoop, ZooKeeper etc., a lot of software bugs are found in these years. These new problems include:
- Distributed concurrency bugs: Caused by non-deterministic timing of concurrent events involving more than one node. Solutions:
- Non-deterministic performance bugs
- Limpware: limping hardware
- Scalability bugs
- Other outage-causing bugs, like SPOF/cascading bugs and cross-layer upgrade bugs etc.
- Kshemkalyani, Ajay D., and Mukesh Singhal. Distributed computing: principles, algorithms, and systems. Cambridge University Press, 2011.
- Pessanha, Vasco, Ricardo J. Dias, and Joao M. Lourenço. “Precise Detection of Atomicity Violations.” In Proc. HVC 2012, LNCS. 2012.
- Qin, Feng, Shan Lu, and Yuanyuan Zhou. “SafeMem: Exploiting ECC-memory for detecting memory leaks and memory corruption during production runs.” 11th International Symposium on High-Performance Computer Architecture. IEEE, 2005.
- Zhou, Pin, et al. “iWatcher: Efficient architectural support for software debugging.” ACM SIGARCH Computer Architecture News. Vol. 32. No. 2. IEEE Computer Society, 2004.
- Rinard, Martin C., et al. “Enhancing Server Availability and Security Through Failure-Oblivious Computing.” OSDI. Vol. 4. 2004.
- Lu, Shan, et al. “Learning from mistakes: a comprehensive study on real world concurrency bug characteristics.” ACM Sigplan Notices. Vol. 43. No. 3. ACM, 2008.
- Zhang, Wei, et al. “ConSeq: detecting concurrency bugs through sequential errors.” ACM SIGPLAN Notices. Vol. 46. No. 3. ACM, 2011.
- Nistor, Adrian, et al. “C aramel: detecting and fixing performance problems that have non-intrusive fixes.” Proceedings of the 37th International Conference on Software Engineering-Volume 1. IEEE Press, 2015.
- Leesatapornwongsa, Tanakorn, et al. “SAMC: semantic-aware model checking for fast discovery of deep bugs in cloud systems.” 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI 14). 2014.
- Leesatapornwongsa, Tanakorn, et al. “TaxDC: A taxonomy of non-deterministic concurrency bugs in datacenter distributed systems.” Proceedings of the Twenty-First International Conference on Architectural Support for Programming Languages and Operating Systems. ACM, 2016.