This is a personal blog. My other stuff: book | home page | Twitter | prepping | CNC robotics | electronics

February 04, 2015

Symbolic execution in vuln research

There is no serious disagreement that symbolic execution has a remarkable potential for programatically detecting broad classes of security vulnerabilities in modern software. Fuzzing, in comparison, is an extremely crude tool: it's the banging-two-rocks-together way of doing business, as contrasted with brain surgery.

Because of this, it comes as no surprise that for the past decade or so, the topic of symbolic execution and related techniques has been the mainstay of almost every single self-respecting security conference around the globe. The tone of such presentations is often lofty: the slides and research papers are frequently accompanied by claims of extraordinary results and the proclamations of the imminent demise of less sophisticated tools.

Yet, despite the crippling and obvious limitations of fuzzing and the virtues of symbolic execution, there is one jarring discord: I'm fairly certain that probably around 70% of all remote code execution vulnerabilities disclosed in the past few years trace back to fairly "dumb" fuzzing tools, with the pattern showing little change over time. The remaining 30% is attributable almost exclusively to manual work - be it systematic code reviews, or just aimlessly poking the application in hopes of seeing it come apart. When you dig through public bug trackers, vendor advisories, and CVE assignments, the mark left by symbolic execution can be seen only with a magnifying glass.

This is an odd discrepancy, and one that is sometimes blamed on the practitioners being backwardly, stubborn, and ignorant. This may be true, but only to a very limited extent; ultimately, most geeks are quick to embrace the tools that serve them well. I think that the disconnect has its roots elsewhere:
  1. The code behind many of the most-cited, seminal publications on security-themed symbolic execution remains non-public; this is particularly true for Mayhem and SAGE. Implementation secrecy is fairly atypical in the security community, is usually viewed with distrust, and makes it difficult to independently evaluate, replicate, or build on top of the published results.

  2. The research often fails to fully acknowledge the limitations of the underlying methods - while seemingly being designed to work around these flaws. For example, the famed Mayhem experiment helped identify thousands of bugs, but most of them seemed to be remarkably trivial and affected only very obscure, seldom-used software packages with no significance to security. It is likely that the framework struggled with more practical issues in higher-value targets - a prospect that, especially if not addressed head-on, can lead to cynical responses and discourage further research.

  3. Any published comparisons to more established vulnerability-hunting techniques are almost always retrospective; for example, after the discovery of Heartbleed, several teams have claimed that their tools would have found the bug. But analyses that look at ways to reach an already-known fault condition are very susceptible to cognitive bias. Perhaps more importantly, it is always tempting to ask why the tools are not tasked with producing a steady stream of similarly high-impact, headline-grabbing bugs.
The uses of symbolic execution, concolic execution, static analysis, and other emerging technologies to spot substantial vulnerabilities in complex, unstructured, and non-annotated code are still in their infancy. The techniques suffer from many performance trade-offs and failure modes, and while there is no doubt that they will shape the future of infosec, thoughtful introspection will probably get us there sooner than bold claims with little or no follow-through. We need to move toward open-source frameworks, verifiable results, and solutions that work effortlessly and reliably for everyone, against almost any target. That's the domain where the traditional tools truly shine, and that's why they scale so well.

Ultimately, the key to winning the hearts and minds of practitioners is very simple: you need to show them how the proposed approach finds new, interesting bugs in the software they care about.


  1. Solver-based testcase generation methods seem to spend a lot of CPU time doing things that aren't testing, and as you say there are plenty of unpublicized gotchas. I've also seen -- in papers about symbolic testcase generation -- some extremely shoddy arguments about why the methods are better than testing. On the other hand, I'd love to see a symbolic add-on package for afl-fuzz that devotes (for example) 20% of total execution time to symbolic testcase generation.

  2. Agree with all the 3 points, I had a similar experience while working on an open source symbolic execution engine PathGrind (available at

  3. One of the major problems in symbolic/concolic execution based tools is that it suffer from state (or path) explosion problem. For example, a program with a simple loop of 20 iterations will lead to 2^20 possible execution paths, which can easily eats up the memory and defeats the whole purpose. To address this, people do targeted symbolic/concolic execution, where the execution is guided towards the "place of interest" and rest of the uninteresting paths are pruned. Here, place of interest can be a potential vulnerable code or crash location.

    The obvious limitation of targeted symbolic execution in that you need to have a priori knowledge of the bug/vulnerability you are looking for. In most cases, in the pre-processing step, some kind of fuzzing or dynamic analysis coupled with taint analysis is used to gain this priori knowledge. Thus, its kinda difficult to identify 0-day bugs with just using symbolic execution and mostly end up catching copy-and-past, code reuse and simple known "template" based bugs.

  4. > Yet, despite the crippling and obvious limitations of fuzzing and the virtues of symbolic execution, there is one jarring discord: probably around 70% of all remote code execution vulnerabilities disclosed in the past few years still trace back to fairly "dumb" fuzzing tools, with no signs of stopping. The remaining 30% is attributable almost exclusively to manual work - be it systematic code reviews, or just aimlessly poking the application in hopes of seeing it come apart. The mark left by symbolic execution can be seen only with a magnifying glass.

    Do you have a citation for those figures? I am indeed jarred by a discord, the one between the statement I quoted and Microsoft's public statements on the matter: "Since 2007, SAGE has discovered many security-related bugs in many large Microsoft applications, including image processors, media players, file decoders, and document parsers. SAGE found approximately one-third of all the bugs discovered by file fuzzing during the development of Microsoft’s Windows 7. Because SAGE is typically run last, those bugs were missed by everything else, including static program analysis and blackbox fuzzing."

    1. I don't think that anyone has seriously studied this, but it suffices to scroll through the security bulletins for Firefox and Chrome to get a sense of the trend - say, . Or, indeed, just talk to the reporters usually credited in these notes or in MSRC bulletins. Or the security engineers working on Chrome or Firefox.

      As for SAGE: it is non-public; so are the vulnerabilities it finds, and the in-house fuzzers / code review processes it is benchmarked against. It is hard to evaluate the underlying data, but perhaps more importantly, the claims are somewhat moot: they are not applicable to any work happening outside of Microsoft. The only "hard" piece of data is that dumb fuzzers still fare pretty well with many Microsoft products and tools. Which means... about nothing, really. To form meaningful opinions, you sort of need to see the tool in action, or the bugs it spews out.

    2. To add to this point, there seems severe performance issues in SAGE and Microsoft is currently working on a technique called "Micro execution" (paper presented at ICSE 2014) to improve SAGE's performance.
      This is an excerpt from their micro execution paper.

      "Every micro execution performed with MicroX is fast in absolute terms (always less than 1 second), as it involves a roughly 10x slow down compared to uninstrumented native x86 execution (see [3]). In contrast, every symbolic execution performed by SAGE involves a roughly 1000x slow down compared to uninstrumented x86 execution (see [20]), and also requires constraint solving and new test generation, which can take seconds or sometimes tens of seconds in some of these experiments"

    3. Michal, you appear to have chosen two specific programs, linked to a list of bugs manually discovered by vulnerability researchers, excluded all other bugs from consideration, and then made the argument that automated tools are deficient for not having discovered them. That formal verification was not brought to bear in the discovery of these particular vulnerabilities is not evidence that these bugs are beyond the reach of these tools. Are you making a claim about technical capability? Or lack of public availability (despite the existence of publicly-available symbolic execution tools)? Or is your claim based on "probably ... [some specific percentage]" of bugs rather than data that can be analyzed? Is your claim about all bugs, all exploitable bugs, all bugs in a set of programs that you choose, all exploitable bugs in that set of programs, some specific set of bugs in that set of programs, or what? Whatever your claim is, it requires substantiating evidence that can be examined by the rest of us.

      Maggie, I am not sure what your point is. That SAGE is not the end of the line in symbolic execution technology and that improvements remain to be made in that realm? That's a trivial consequence of Rice's theorem. Symbolic execution's limitations are well-understood by the formal verification community, namely that they trade non-approximative answers for incompleteness and NP-complete-or-beyond decision problems. You will see papers on improving symbolic execution until you die. What is your point?

    4. Rolf: I am not sure I follow your critique; I don't think this is really the argument I'm trying to make? If it comes across as such, then I think I must have misspoke, but I can't really see where.

      Now, I will give you that the specific percentage I give isn't particularly well substantiated to folks who aren't following OSS vulnerability research on a daily basis - although I suspect that it isn't particularly controversial for those who do; I suggest chatting with the prolific practitioners usually credited in vendor bulletins from Microsoft, Apple, Google, Mozilla, and so on; or the security teams behind these products. I'm sure they will offer similar perspectives - and ultimately, they are the ones you want on board for more revolutionary tools if you want to make an impact and make the Internet a safer place.

      If you are looking for something more objective; there are high hundreds of remotely exploitable vulnerabilities disclosed every year in critical components of the online ecosystem (web browsers, common network services, key libraries, etc - a fairly specific subset of "interesting" platforms can be found here:; they are sort-of centrally traceable through CVE IDs. The data is easy to pour over to see how many of the reports trace back to which tools and which research groups. It does offer a fairly simple large-scale benchmark, too.

      And yup, I mean remotely exploitable vulnerabilities, not just bugs. Identifying software defects in general is decidedly outside the scope of this post and follows very different rules. Some of the most radical and catchy claims surrounding symbolic execution are made specifically around security flaws; bugs that aren't exploitable are difficult to classify as security bugs.

      Of course, the results may be different if you factor in claims about non-public work; for example, SAGE is claimed to have found hundreds of bugs. But that doesn't seem like a particularly meaningful exercise: non-public bug reports and tools are inscrutable and do not have a direct impact on the outside world.

      That said, let me try to refocus the discussion: the argument I hoped to make does come down to the plea made in the last line of the post. Let's take browsers and other critical components mentioned earlier on as an example: virtually all the vulnerability research and proactive work is done by a well-defined group of practitioners. One Michal Zalewski says that they don't use symbolic execution much. If you don't accept this claim, that's fair; if you do, the question is, how do we convince them to use the techniques seen as radically better than fuzzing by a good chunk of the academic research community? And I very, very strongly suspect that the answer is just "by showing them cool, new bugs".

    5. @Rolf: My point is, symbolic execution is not widely adopted by security researchers, outside academia, due to its performance issues. In academia, its a hot topic and several groups (e.g., MS, CMU) are actively working on improving the performance of symbolic execution. However, the results they present in their papers are not very convincing in terms of its usefulness to hunt exploitable 0-day bugs at its present state. What I often see is claims like "our tool can detect several bugs (which are often already reported) with x-fold improvement in performance", however, to achieve some of their claims, they sometimes make unrealistic claims like test suite is available for the target binary, potential vulnerable point is already known, etc...

      These are just my observations. However, I agree that over a period of time it will improve and I hope to see lots of security researchers use symbolic execution to hunt security-critial bugs.

    6. typo in my previous comment: *unrealistic assumptions

    7. Michal, OK, fair enough: how do we foster collaboration between academics and industry to leverage the capabilities of both groups towards hardening critical targets. It's been unfortunate to watch this situation unfold, as I think that there are a lot of bright and talented people in both camps motivated by the common goal of improving end-user security, but that these groups have failed to connect. Part of it is the learning curve; formal verification and program analysis are difficult, as is writing exploits for modern targets -- and we all have day jobs. Part of it is cultural: academics' careers hinge upon academic trends, leaving them unmotivated and unable to keep their ear to the ground in low-level security; industrial researchers have largely eschewed academic security work on grounds of utility, availability, skepticism of overbroad claims, and confusion surrounding the underlying technology as understanding it requires specialized education. Another part is technological; for one example, academics researchers are not "developers" and often create only prototypes; for another example, BitBlaze and BAP are written in OCaml, which is not widely used outside of academia, and which makes it more difficult for industrial types to adapt and extend (though BAP has made strides toward creating useful foreign function interfaces).

      I think that ultimately, we are all participating in research that is still unfolding. I think Microsoft Research has been a great model for how to conduct a collaborative effort in this area, having combined top-notch academic talent within MSR, top-notch security talent from MSRC, and top-notch development talent to produce SAGE and other projects. Perhaps an endeavor like this is within Google's organizational ken and worthy of its attention (given its willingness to engage in initiatives such as Project Zero). It would be great to see bright academics with no publish-or-perish motive be able to work side-by-side with vulnerability researchers to hash out prototypes of things that the hackers would find useful, where developers could then refine them into tools that people actually used. It would be a lovely vision indeed for Google to become engaged in funding the development of automated vulnerability analysis tools for community and research use (weighing input from both camps equally), and creating tutorial material as a runway for each side to approach the other. The future lies at the convergence of the language and the tools.

      (Though I had no intention of mentioning it, given the contextual appropriateness, my part of the solution to the cross-cultural collaboration issue is that I have completed and now teach a training course vernacularizing SMT-based program analysis for non-academic types: )

  5. Solvers can be useful if used in the right way. They are more like a sniper rifle rather than a shotgun. E.g. I've used them to find a collision in a weak proprietary hash algorithm that spit out 32 bit hashest. It was way faster than analyzing the algorithm manually (60 seconds of coding + setting assertions. The solver found the collision after some seconds running on a desktop machine). My colleague later tried bruteforce out of curiosity but that did not yield any results even after some hours.

    I started working on extending afl with a solver. Not to find bugs directly but to help discovery of additional paths. As AFL already knows about possible execution paths inside the program, one could imagine using concolic execution to find new paths. I am pretty sure this will help discovering paths that are not discoverable by bruteforce easily.

  6. I find it surprising you haven't mentioned Klee: It's being actively developed and open-source for quite a while.
    To tackle path explosion in symbolic execution, selective symbolic execution has been proposed. You can have a look at here S2E: which is based on Klee.

    1. KLEE is interesting. I didn't mention it because at least as far as I recall, it does not make particularly radical security claims (although it is featured in third-party papers at a bunch of security conferences).

      It is also a fairly good illustration of the limitations of symbolic execution; one of the more significant claims made for KLEE is that it gets good block coverage in coreutils, which are a collection of generally *exceedingly* simple tools (touch, rm, touch, echo, ln, etc). That's a pretty good stepping stone, but it's a *very* far cry from the complexity of the software you usually worry about (say, libpng, libxml, libjpeg-turbo, ffmpeg, spidermonkey, etc). I think we'll get there, but it will take a lot of work.

      Somewhat frustratingly, even the paper that touts the success with coreutils seems to be far more hyperbolic than it really needs be. When you run a fuzzer against almost any decades-old unix tool, you will probably uncover dozens of fault conditions in paths that are unlikely to be ever followed when the tool is actually being used in the intended way. But instead of comparing to that benchmark, the paper takes an odd turn:

      "The 90-hour run used to generate its COREUTILS coverage beat the developer’s own test suite built incrementally over fifteen years by over 14%!"

      This statement is then repeated in the conclusions of the paper. This seems silly: the coreutils test suite certainly doesn't seem to be designed to be exhaustive. It doesn't even cover all the tools, and for the ones that are covered, it doesn't test every documented command-line flag. The tests seem geared just toward spot-checking for likely mishaps and historical bugs.

      On top of that, the cumulative time that went into writing these simple tests (or finding the underlying corner cases) is certainly not 14 years, so comparing this to 90 hours of running KLEE (+ unspecified setup time) seems meaningless. I would be surprised if the cumulative time spent on building the test suite is more than a couple of days. You can't even make an argument that KLEE is an alternative to writing the test suite by hand: the suite checks whether the tools behave in accordance with what the authors expect, which is a separate challenge to solve.

      I'm saying all this not to nitpick, but because it sort of follows the pattern also seen in security research around symbolic executions. The results are very promising, but they are very early stage. We need to work on making them more reliable, on understanding the practical bottlenecks, and on setting up benchmarks that actually matter and can convince more people to embrace new tools.

  7. (I tried to post this earlier, but the comment apparently got lost. Apologies if this becomes a double post.)

    This is a subject very close to my heart. I recently finished a PhD which was originally intended to be on automatic exploit generation. I found pretty early on that the tools you need to do this don't exist, so my research became focused on building the basic techniques you need for automating security analysis.

    My perspective on why symbolic execution isn't currently very useful for exploit finding is twofold:

    First, symbolic execution is part of the wider software verification community. Historically, this community has been mainly interested with proving correctness, I.e. building analysers that will never claim that a buggy program is correct. In order to achieve this, they're willing to actor a certain rate of false alarms -- they'll often say that a program might have a big, even if it doesn't. These techniques are fairly non constructive, so they can't give you a crashing input if there is a bug. Also, the community is often interested in analysing idealised machines & so ignore things like integer overflows. Unfortunately, these properties are the opposite of what you want when doing security analysis: if my analyser tells me there's an exploit in OpenSSH, I want to be damn sure that there is and I want a crashing trace to prove it. There is a family of symbolic analysers that doesn't generate false alarms (e.g. klee, mayhem, cbmc) but these guys are limited in the bugs they can find. Klee et al. explore one execution path at a time, which means they're not much more efficient than fuzzing. They can find deep bugs, but will miss lots of things, especially some of the more tricky errors. Cbmc and other bounded model checkers explore every path simultaneously, so they can find very complex bugs but can only deal with a few iterations of loops. My research has been on addressing some of these issues: I've been building tools that don't give false alarms & that can deal with deep bugs in c programs. If you're interested, my thesis is here: It's obviously pretty inaccessible & technical, but lolphds I guess. All my tools are open source, but are not production quality, which brings me to my next point.

    There's a cultural issue preventing these tools being useful for security analysis. As an academic you are actively disincentivised from building robust tools. This is because you are evaluated only on the number of papers you write. If I spend 6 months productionising one of my research tools, that's at least one paper I haven't written. The productionising work is completely invisible to the system evaluating me and so to all intents and purposes I've done nothing for 6 months. These tools are very complex and most academics are very bad engineers, so it is unsurprising that our tools rarely work robustly. As a result, most people will refuse to let others even see their code out of embarrassment.

    As a result of both these things, academic tools aren't very useful for exploit finding and so the security guys have had no incentive to pick them up and improve them. That said, I do believe that we're on the verge of an explosion in this area: the first problem is just technical & solvable, the second problem will be solved as soon as there's a startup or similar doing this stuff rather than academia, since the incentives will be very different for them. I think we'll then see a whole raft of whacky 0days & a real rise of the robots ;-)

    1. Re: the startup route - possibly. I think it's been tried a fair number of times. As one example, the guys behind Mayhem (including David Brumley) have an oddly-named startup to commercialize the idea ( Maybe it will take off?

      Ultimately, though, it is a perilous route: if you want to sell the tools to enterprises, you do need them to be able to deliver at least somewhat practical results. It is remarkably tough to do it with static analysis when dealing with ad-hoc code and not having a scientist on site to fine-tune the code; so if your experiments so far have been designed chiefly to make for an impressive paper, the start-up may face some serious challenges down the line.

      Of course, we do have a number of successful companies that have been selling static analysis tools to enterprises. But in some regards, static analysis makes it simpler to deliver a baseline of decent results; heck, compilers have been warning people about undefined or error-prone syntax patterns, and we had security-oriented linters, long before static analysis became a field of commercial enterprise.

      (And 1-2 decades later, the impact of smarter-than-a-compiler static analysis on the security of unconstrained code isn't entirely clear-cut, and its adoption rate in the commercial or open-source world is pretty low.)