How to generate AVR (Arduino) Assembly Control Flow Graphs with C#

In all fairness, for any Arduino (or Atmel AVR) program it is hard to fathom a more authoritative source (on its behavior) than the list of machine instructions it ultimately compiles down to.

Disassemblers for viewing the contents of .HEX files (the end result of compilation that is typically flashed through the bootloader on your Arduino) are not hard to come by. While the industry standard is definitely avr-objdump (in the avr-gcc toolchain), multiple people (including myself) wrote their own (you can find mine here).

Reading these disassembled listings can still be a laborious task (even more so when the code is generated by a compiler rather than manually assembled by a human programmer). The old adage “a picture is worth a thousand words” is true!

This blog post will focus on how to generate (basic) Control Flow Graphs, a computer science (code representation) primitive often attributed to have originated in the works of Frances E. Allen.

For this particular purpose (emitting graphs), generating our intermediate output in the DOT language makes sense. On top of the fact that .DOT files are (1) easy to generate and (2) human readable and editable, the Open Source GraphViz project libraries will make subsequent exports of these files to common formats easy (SVG, PNG, JPG, …).

Potential practical applications can be found in abundance, from static code flow analysis that detects unreachable code to (automated or semi-automated) side channel timing attack vulnerability analysis.

A quick preview of the type of output we will be generating:

Sample Output

Disclaimer:

Full accompanying source code for this blog post can be found on GitHub.
Rather than being an exhaustive solution the code provided is an initial stab at the problem. In particular, not all branching statements (e.g. IJMP) are supported.
The sources are C#, .NET Core so they should work cross-platform (Windows, Linux, Mac OSX).

Boilerplate for parsing Command Line Options

The great CommandLineParser library makes it extremely simple to declaratively define (and subsequently parse) command line options. All that is required is something along these lines:

internal class Program
{
    private static readonly Logger Logger =
        LogManager.GetLogger(AVRControlFlowGraphGenerator.LoggerName);

    private static void Main(string[] args)
    {
        Logger.Info("Parsing Command Line Arguments...");
        Parser.Default.ParseArguments<CommandLineOptions>(args)
            .WithParsed(options =>
            {
                var generationOptions = new AVRControlFlowGraphGeneratorOptions
                {
                    InputFile = options.InputFile,
                    OutputFile = options.OutputFile,
                    StartAddress = options.StartAddress,
                    DPI = options.DPI
                };

                try
                {
                    var generator =
                        new AVRControlFlowGraphGenerator(generationOptions);
                    generator.GenerateCFG();
                    Environment.Exit((int) ExitCode.Success);
                }
                catch (Exception ex)
                {
                    Logger.Error(ex, $"Unexpected runtime error: {ex.Message}!");
                    Environment.Exit((int) ExitCode.RuntimeError);
                }
            })
            .WithNotParsed(options =>
                Environment.Exit((int) ExitCode.FailedToParseCommandLineArgs));
    }
}

using CommandLine;

namespace AVRControlFlowGraphGenerator
{
    public class CommandLineOptions
    {
        [Option('i', "input", Required = true,
            HelpText = "Input file (.HEX) for which to generate a CFG.")]
        public string InputFile { get; set; }

        [Option('o', "output", Required = true, HelpText = "Output file (.DOT).")]
        public string OutputFile { get; set; }

        [Option('s', "startaddress", Required = false, Default = 0,
            HelpText = "Start Offset (defaults to 0x0).")]
        public int StartAddress { get; set; }

        [Option('d', "dpi", Required = false, Default = 200, HelpText = "DPI.")]
        public int DPI { get; set; }
    }
}

Running the tool without parameters will then automagically report on all parameters.

2017-09-06 00:29:18.0345|INFO|AVRCFGGen|Parsing Command Line Arguments...
AVRControlFlowGraphGenerator 1.0.0
Christophe Diericx

ERROR(S):
  Required option 'i, input' is missing.
  Required option 'o, output' is missing.

  -i, --input           Required. Input file (.HEX) for which to generate a CFG.

  -o, --output          Required. Output file (.DOT).

  -s, --startaddress    (Default: 0) Start Offset (defaults to 0x0).

  -d, --dpi             (Default: 200) DPI.

  --help                Display this help screen.

  --version             Display version information.

Generating the Control Flow Graph

The AVRControlFlowGraphGenerator class takes some configuration options in its constructor, and has a single public GenerateCFG method (return type void) where all the necessary work is performed.

Arguably an object-model graph should be returned here for reusability purposes (with the actual translation into a dot style graph elsewhere). But for our quick prototype purposes this will do.

public class AVRControlFlowGraphGenerator
{
    internal const string LoggerName = "AVRCFGGen";
    private static readonly Logger Logger = LogManager.GetLogger(LoggerName);
    private AVRControlFlowGraphGeneratorOptions Options { get; }
    private AssemblyStatement[] Disassembly { get; set; }
    private int HighestOffset { get; set; }
    private IDictionary<int,AssemblyStatement> AddressLookup { get; set; }
    private IList<BasicBlock> BasicBlocks { get; }
    private HashSet<Relation> Relations { get; }
    private HashSet<int> VisitedEntryPoints { get; }
    private Stack<int> CallStack { get; }

    public AVRControlFlowGraphGenerator(AVRControlFlowGraphGeneratorOptions options)
    {
        Options = options;
        AddressLookup = new Dictionary<int, AssemblyStatement>();
        BasicBlocks = new List<BasicBlock>();
        Relations = new HashSet<Relation>();
        VisitedEntryPoints = new HashSet<int>();
        CallStack = new Stack<int>(1024);
    }

    public void GenerateCFG()
    {
        ...
    }
}

1. Disassembling the HEX file

Using the AVRDisassembler library, disassembling a HEX file (into an array of AssemblyStatement instances) is a couple of lines of code at most. We will also create a lookup table that allows us to instantly retrieve an AssemblyStatement at a given logical offset in the code.

// Disassemble
Logger.Info($"Disassembling '{Options.InputFile}'...");
Disassembly = Disassemble(Options.InputFile);
Logger.Trace(
    string.Join(
        Environment.NewLine,
        new []{ "Disassembly:" }.Concat(
            Disassembly.Select(x => x.ToString()))
    ));

// Store highest address
HighestOffset = Disassembly.Last().Offset;
Logger.Debug($"Highest Offset: {HighestOffset}.");

// Create direct address lookup table
Logger.Debug("Creating address lookup table...");
AddressLookup = Disassembly.ToDictionary(stmt => stmt.Offset);

private static AssemblyStatement[] Disassemble(string file)
{
    var disAssembler = new Disassembler(
        new DisassemblerOptions
        {
            File = file, JsonOutput = false
        });
    return disAssembler.Disassemble().ToArray();
}

private AssemblyStatement FindEntry(int offset)
{
    AddressLookup.TryGetValue(offset, out var result);
    if (result == null) throw new InvalidOperationException(
        $"Unable to find statement at offset {offset:X8}!");
    return result;
}

The FindEntry method implementation showcases a few recent additions to the C# language:

The ability to declare the out variable directly (and even implicitely) in the argument list of the method call (as can be seen on the first line of FindEntry) was introduced in C# 7.
The $ style string interpolation syntax is one of the greatest recent additions in my opinion (introduced in C# 6).

2. Scanning for “Basic Blocks” and mapping “Relations”

As per the Wikipedia definition:

In compiler construction, a basic block is a straight-line code sequence with no branches in except to the entry and no branches out except at the exit.[1][2] This restricted form makes a basic block highly amenable to analysis.[3] Compilers usually decompose programs into their basic blocks as a first step in the analysis process. Basic blocks form the vertices or nodes in a control flow graph.

In our first pass, we will do a depth-first search of the tree:

// First pass: traverse the tree recursively (depth first)
Logger.Info("First pass (traversing the tree)...");
foreach (var block in ParseBlock(Options.StartAddress))
{
    Logger.Debug($"Adding <{block}>");
    BasicBlocks.Add(block);
}

private IEnumerable<BasicBlock> ParseBlock(int startOffset)
{
    var stmt = FindEntry(startOffset);
    var currentBlock = new BasicBlock();
    Logger.Trace($"Visiting {startOffset}..");
    VisitedEntryPoints.Add(startOffset);

    while (true)
    {
        currentBlock.Add(stmt);
        if (IsBranch(stmt))
        {
            yield return currentBlock;

            var relationsToAdd = new List<Relation>();
            Logger.Debug($"Encountered {stmt.OpCode.Name}: {stmt}");

            // Unconditional Jumps
            if (stmt.OpCode is OpCodes.Branch.JMP)
            {
                relationsToAdd.Add(
                    new Relation(startOffset, stmt.Operands.ToArray()[0].Value));
            }
            else if (stmt.OpCode is OpCodes.Branch.RJMP)
            {
                var op = stmt.Operands.ToArray()[0];
                relationsToAdd.Add(
                    new Relation(startOffset, stmt.Offset + op.Value + 2));
            }
            // Conditional Jumps
            else if (stmt.OpCode is OpCodes.Branch.BRBC
                || stmt.OpCode is OpCodes.Branch.BRBS)
            {
                var op = stmt.Operands.ToArray()[1];
                relationsToAdd.Add(
                    new Relation(startOffset, stmt.Offset + op.Value + 2));
                relationsToAdd.Add(
                    new Relation(startOffset, stmt.Offset + 2));
            }
            // Call And Return
            else if (stmt.OpCode is OpCodes.Branch.CALL)
            {
                relationsToAdd.Add(
                    new Relation(startOffset, stmt.Operands.ToArray()[0].Value));
                CallStack.Push(stmt.Offset + 4);
            }
            // Return
            else if (stmt.OpCode is OpCodes.Branch.RET)
            {
                var returnTarget = CallStack.Pop();
                relationsToAdd.Add(new Relation(startOffset, returnTarget));
            }

            foreach (var relation in relationsToAdd)
            {
                var source = relation.Source;
                var target = relation.Target;
                Logger.Trace($"Adding relation {source} -> {target}");
                if (!Relations.Contains(relation)) Relations.Add(relation);
                if (VisitedEntryPoints.Contains(target)) continue;
                foreach (var block in ParseBlock(target))
                    yield return block;
            }
            break;
        }
        var nextSequentialInstruction = stmt.Offset + stmt.OriginalBytes.Count();
        if (nextSequentialInstruction > HighestOffset) break;
        stmt = FindEntry(nextSequentialInstruction);
    }
}

private static bool IsBranch(AssemblyStatement stmt)
{
    return stmt.OpCode is OpCodes.Branch.JMP || stmt.OpCode is OpCodes.Branch.RJMP
        || stmt.OpCode is OpCodes.Branch.BRBC || stmt.OpCode is OpCodes.Branch.BRBS
        || stmt.OpCode is OpCodes.Branch.CALL || stmt.OpCode is OpCodes.Branch.RET;
}

Conceptually, what happens here is relatively straightforward:

We keep adding statements to our current Basic Block until we encounter a Branch instruction:
- Any branching instruction qualifies (please note that not all branching instructions in the AVR instruction set are currently supported here).
- At that moment, we add a Relation (node edge) between the current Basic Block and the Basic Blocks implicitely started by our Jump Target Offset(s).
- If the Basic Blocks started by our Jump Target Offset(s) have not been visited yet, we (recursively) walk them (and their node edges) next.
- We use a little Stack structure in order to keep track of (possibly nested) subroutine calls (and associated returns).

This gives us a reasonable graph already, but a particular problem is still present with this particular algorithm: it is possible a jump brings us at an offset that was already a part of a Basic Block (but not coinciding with either start or end of that particular block).

In that case, we will end up with a graph that is not necessarily incorrect, but will still contain sections of code that are “duplicated”:

Sample Output

As seen above, Basic Block [010C to 0110] is fully contained within Basic Block [0108 to 0110].

In order to solve this, we want to shrink the larger Basic Block [0108 to 0110] to [0108 to 010A]. Then, we should add a relation from that newly dimensioned block to the existing Basic Block [010C to 0110].

In order to do so, we will issue a second pass of our tree:

// Second pass: find blocks that overlap, and split them
Logger.Info($"Number of Basic Blocks found: {BasicBlocks.Count()}.");
var relationsToRemove = new List<Relation>();
var relationsToAdd = new List<Relation>();
foreach (var block in BasicBlocks)
{
    var blockStart = block.FirstOffset;
    var blockEnd = block.LastOffset;
    foreach (var otherBlock in BasicBlocks.Where(x => x.Guid != block.Guid))
    {
        var otherBlockStart = otherBlock.FirstOffset;
        var otherBlockEnd = otherBlock.LastOffset;
        var overlap = blockStart < otherBlockEnd && otherBlockStart < blockEnd;
        if (!overlap) continue;
        Logger.Trace($"Overlapping blocks found: {block}-{otherBlock}.");

        /*
             *  |                  +------------+
             *  |                  | otherBlock |
             *  |   +------------+ |            |
             *  |   | block      | |            |
             *  |   +------------| +------------+
             * \|/
             *  '
             * or
             *
             *  |   +------------+
             *  |   | block      |
             *  |   |            | +------------+
             *  |   |            | | otherBlock |
             *  |   +------------| +------------+
             * \|/
             *  '
             */
        if (blockEnd == otherBlockEnd && blockStart != otherBlockStart)
        {
            var biggestBlock = blockStart > otherBlockStart ? otherBlock : block;
            var smallestBlock = blockStart > otherBlockStart ? block : otherBlock;

            // Shrink the biggest block ...
            biggestBlock.RemoveAll(x => x.Offset >= smallestBlock.FirstOffset);

            // Move relations of biggest block ...
            foreach (var otherRelation in Relations.Where(
                x => x.Source == biggestBlock.FirstOffset))
            {
                relationsToRemove.Add(otherRelation);
                relationsToAdd.Add(
                    new Relation(smallestBlock.FirstOffset, otherRelation.Target));
            }
            // ... and add a new relation
            relationsToAdd.Add(
                new Relation(biggestBlock.FirstOffset, smallestBlock.FirstOffset));
        }
        else
        {
            throw new NotImplementedException();
        }

        // Remove relations
        foreach (var relationToRemove in relationsToRemove)
            Relations.Remove(relationToRemove);

        // Add relations
        foreach (var relationToAdd in relationsToAdd)
            if (!Relations.Contains(relationToAdd))
                Relations.Add(relationToAdd);
    }
}

After issuing the second pass, the problematic snippet above will look correct:

Sample Output

3. Generating the .DOT file

Generating the graph file in the DOT language mostly consists of adding our nodes (“Basic blocks”) and node edges (“Relations”) and applying some styling.

// Generate graph
var graphBuilder = new StringBuilder();
graphBuilder.AppendLine("digraph G {");
graphBuilder.AppendLine($"    graph[dpi={Options.DPI},bgcolor=\"#333333\",fontcolor=\"white\"];");
graphBuilder.AppendLine("    node[style=filled,shape=box,fontcolor=\"white\",color =\"white\",fillcolor=\"#006699\",fontname=\"Consolas\"];");
graphBuilder.AppendLine("    edge[style=dashed,color=\"white\",arrowhead=open];");

foreach (var block in BasicBlocks)
    graphBuilder.AppendLine(
        $"    \"{FormatBlock(block.FirstOffset)}\" "
        + $"[label=\"{FormatContents(block.ToArray())}\"];");

foreach (var relation in Relations)
    graphBuilder.AppendLine(FormatRelation(relation));

graphBuilder.AppendLine("}");
var graph = graphBuilder.ToString();
File.WriteAllText(Options.OutputFile, graph);

private static string FormatRelation(Relation relation)
{
    return $"    \"{FormatBlock(relation.Source)}\" -> \"{FormatBlock(relation.Target)}\";";
}

private static string FormatBlock(int startOffset)
{
    return $"Block_{startOffset:X8}";
}

private static string FormatContents(IEnumerable<AssemblyStatement> statements)
{
    return string.Join(@"\l",
        statements.Select(x => x.ToString()).Concat(new[] { string.Empty }));
}

Once the .DOT file is generated, we can convert it to another format (e.g. PNG, SVG, …) using the dot command line tool (part of the GraphViz installation, available on many platforms):

dot MyFile.dot -Tpng -o MyFile.png

Sample Output

Here is some sample output (based on the Blink program in the Arduino code example section):

Zoomed out rendered graph:

Sample Output