This is a personal blog. My other stuff: book | home page | Twitter | prepping | CNC robotics | electronics

August 08, 2014

Binary fuzzing strategies: what works, what doesn't

Successful fuzzers live and die by their fuzzing strategies. If the changes made to the input file are too conservative, the fuzzer will achieve very limited coverage. If the tweaks are too aggressive, they will cause most inputs to fail parsing at a very early stage, wasting CPU cycles and spewing out messy test cases that are difficult to investigate and troubleshoot.

Designing the mutation engine for a new fuzzer has more to do with art than science. But one of the interesting side effects of the design of american fuzzy lop is that it provides a rare feedback loop: you can carefully measure what types of changes to the input file actually result in the discovery of new branches in the code, and which ones just waste your time or money.

This data is particularly easy to read because the fuzzer also approaches every new input file by going through a series of progressively more complex, but exhaustive and deterministic fuzzing strategies - say, sequential bit flips and simple arithmetics - before diving into purely random behaviors. The reason for this is the desire to generate the simplest and most elegant test cases first; but the design also provides a very good way to quantify how much value each new strategy brings in to the table - and whether we need it at all.

The measurements of afl fuzzing efficiency for reasonably-sized test cases are remarkably consistent across a variety of real-world binary formats - anything ranging from image files (JPEG, PNG, GIF, WebP) to archives (gzip, xz, tar) - and because of this, I figured that sharing the data more broadly will be useful to folks who are working on fuzzers of their own. So, let's dive in:
  • Walking bit flips: the first and most rudimentary strategy employed by afl involves performing sequential, ordered bit flips. The stepover is always one bit; the number of bits flipped in a row varies from one to four. Across a large and diverse corpus of input files, the observed yields are:

    • Flipping a single bit: ~70 new paths per one million generated inputs,
    • Flipping two bits in a row: ~20 additional paths per million generated inputs,
    • Flipping four bits in a row: ~10 additional paths per million inputs.

    (Note that the counts for every subsequent pass include only the paths that could not have been discovered by the preceding strategy.)

    Of course, the strategy is relatively expensive, with each pass requiring eight execve() per every byte of the input file. With the returns are diminishing rapidly, afl stops after these three passes - and switches to a second, less expensive strategy past that point.

  • Walking byte flips: a natural extension of walking bit flip approach, this method relies on 8-, 16-, or 32-bit wide bitflips with a constant stepover of one byte. This strategy discovers around ~30 additional paths per million inputs, on top of what could have been triggered with shorter bit flips.

    It should be fairly obvious that each pass takes approximately one execve() call per one byte of the input file, making it surprisingly cheap, but also limiting its potential yields in absolute terms.

  • Simple arithmetics: to trigger more complex conditions in a deterministic fashion, the third stage employed by afl attempts to subtly increment or decrement existing integer values in the input file; this is done with a stepover of one byte. The experimentally chosen range for the operation is -35 to +35; past these bounds, fuzzing yields drop dramatically. In particular, the popular option of sequentially trying every single value for each byte (equivalent to arithmetics in the range of -128 to +127) helps very little and is skipped by afl.

    When it comes to the implementation, the stage consists of three separate operations. First, the fuzzer attempts to perform subtraction and addition on individual bytes. With this out of the way, the second pass involves looking at 16-bit values, using both endians - but incrementing or decrementing them only if the operation would have also affected the most significant byte (otherwise, the operation would simply duplicate the results of the 8-bit pass). The final stage follows the same logic, but for 32-bit integers.

    The yields for this method vary depending on the format - ranging from ~2 additional paths per million in JPEG to ~8 per million in xz. The cost is relatively high, averaging around 20 execve() calls per one byte of the input file - but can be significantly improved with only a modest impact on path coverage by sticking to +/- 16.

  • Known integers: the last deterministic approach employed by afl relies on a hardcoded set of integers chosen for their demonstrably elevated likelihood of triggering edge conditions in typical code (e.g., -1, 256, 1024, MAX_INT-1, MAX_INT). The fuzzer uses a stepover of one byte to sequentially overwrite existing data in the input file with one of the approximately two dozen "interesting" values, using both endians (the writes are 8-, 16-, and 32-bit wide).

    The yields for this stage are between 2 and 5 additional paths per one million tries; the average cost is roughly 30 execve() calls per one byte of input file.

  • Stacked tweaks: with deterministic strategies exhausted for a particular input file, the fuzzer continues with a never-ending loop of randomized operations that consist of a stacked sequence of:

    • Single-bit flips,
    • Attempts to set "interesting" bytes, words, or dwords (both endians),
    • Addition or subtraction of small integers to bytes, words, or dwords (both endians),
    • Completely random single-byte sets,
    • Block deletion,
    • Block duplication via overwrite or insertion,
    • Block memset.

    Based on a fair amount of testing, the optimal execution path yields appear to be achieved when the probability of each operation is roughly the same; the number of stacked operations is chosen as a power-of-two between 1 and 64; and the block size for block operations is capped at around 1 kB.

    The absolute yield for this stage is typically comparable or higher than the total number of execution paths discovered by all deterministic stages earlier on.

  • Test case splicing: this is a last-resort strategy that involves taking two distinct input files from the queue that differ in at least two locations; and splicing them at a random location in the middle before sending this transient input file through a short run of the "stacked tweaks" algorithm. This strategy usually discovers around 20% additional execution paths that are unlikely to trigger using the previous operation alone.

    (Of course, this method requires a good, varied corpus of input files to begin with; afl generates one automatically, but for other tools, you may have to construct it manually.)

As you can see, deterministic block operations (duplication, splicing) are not attempted in an exhaustive fashion; this is because they generally require quadratic time (or worse) - so while their yields may be good for very short inputs, they degrade very quickly.

Well, that's it! If you ever decide to try out afl, you can watch these and other cool stats on your screen in real time.


  1. was wondering if you could clarify a couple of things about the mutation techniques. thank you and appreciate your help.

    in regards to the "Known integers”, "Simple arithmetics”, "Stacked tweaks” specifically "Block deletion, Block duplication via overwrite or insertion, Block memset”; i was wondering do your fuzzing technique depend on grammar based fuzzing like you identify the interesting ints, and blocks from the targeted file spec or you mutate the file at random locations??

  2. This comment has been removed by the author.

  3. BTW you could output these perf results for fuzzing to a file in a nicely laid-out format and include a script that can be used to upload to e.g. your server. Worst case, people get nicely laid-out data about which was more useful. Best case, you get anonymous feedback.