HexaMonkey - An open-source binary file analyzer

HexaMonkey Architecture

General Architecture

Architecture Overview

The compiler

The compiler is in charge of converting the HMDL text files (.hm) into compiled binary files (.hmc), using the tools Flex and Bison in charge of lexing and parsing. The compiler outputs in a special format that implements the EBML specifications. It must also provide to the core the model.h file, which contains information on the syntax model of a HMDL.

The core

The main goal of the core is to extract a structured data tree from a binary file, using Modules.

The main class in the core is the ModuleLoader which assigns Module to File to parse them. The modules are responsible for generating Parser that decompose a file into tree structure with the nodes being instances of the class Object. Modules can be imported into others so that generic parsers can be used by several Module such as integers or strings.

The Module can either be native or HMDL. The native modules are subclasses of the virtual Module class. These include the default module, which is imported by every module automatically and provides basic structures such as the basic file, the array... and the standard module which provides basic atoms such as integers, floats, strings..., this module exist under two versions : the little endian version (codename lestd) and the big endian version (codename). To create a native module the most pratical class to reimplement is the MapModule. However it is recommended when possible to write a new module as a HMDL.

1. Loading native modules

The first thing to do is to import the mandatory modules, through the ModuleLoader. We give a list of the native modules that will be used for the further parsing.

ModuleLoader moduleLoader;
moduleLoader.addModule("hmc",   new HmcModule(getFile(modelsDirs, "hmcmodel.csv")));
moduleLoader.addModule("ebml",  new EbmlModule);
moduleLoader.addModule("mymodule", new MyModule(true));
...

By doing this, the ModuleLoader will keep a map linking the string "mymodule" to the class MyModule.

2. Accessing modules

Later on, a module could be accessed using :

moduleLoader.getModule("mymodule");

This will return a MyModule object, but will also import in it every module dependency MyModule has.

One can also get the appropriate module for a file:

const Module& module = moduleLoader.getModule(file);

Internal : each module in the moduleLoader can contain one or many FormatDetectors, that will either look for specific file extension, magic number (byte sequence at the beginning of the file) or syncbyte (periodical occurence of a byte sequence in the file, e.g. 'G' in MPEG-TS). The first module with a matching format detector will be returned.

Theorically, this could be enough to interpret data, if each Modules corresponding to a file type (png, gif, etc...) was implemented. However, most of them are described using HMDL (see part 4).

3. Parsing a data file

Once the module corresponding to a file has been loaded, the file may be parsed using

Object* obj = module.handle(defaultTypes::file, file);

Internal : here, the new Object is given all the Parsers necessary to parse itself, but it has not been parsed yet. An Object is a tree structure : it contains a list of pointers to other Objects (its node). Since it is not parsed, initially this list is empty.

Once all this has been done, one may extract data from the Object using either access, lookingForType or lookUp. For instance:

object.lookUp(child_name, forceParse);

looks in the already available childs, and returns the child of name child_name (if found). If it is not found, the boolean forceParse will determine if the Object should continue to parse itself until child_name is found, or simply returns a null pointer.

4. Loading modules from HMDL files

To create modules from a HMDL file, we must call the ProgramLoader :

ProgramLoader programLoader(static_cast<const HmcModule &>(moduleLoader.getModule("hmc")), compilerDirs, userDir);
moduleLoader.setDirectories(scriptsDirs, programLoader);

Internal : Here, each HMDL file found int scriptDirs is re-compiled (if there were any change since last compilation). The compiled files are put in userDir and then parsed using the HmcModule, entirely (no lazy parsing). This will output a Program per HMDL file. which has in its underlying structure an Object (tree), that contains all relevant information for its file type (tree structure).

Finally, those programs are stored in the moduleLoader as FromFileModules, with the prefix of the HMDL name as the map key. (e.g. png.hm will give access to a module at the key "png").

The HMDL language can also be used on the fly to evaluate expression. This is for instance used by the Filter class.

5. BlockExecution, Evaluator and Scopes

Not only does a HMDL file contains the structure of the file, it also ships dynamic langage that make it possible to create models of files that may have a dynamic structure (through conditional branching, loops, etc...). This is dealt with the BlockExecution Evaluator and Scope classes. Those class are therefore specific to HMDL, they are not used in native modules.

a. BlockExecution

A BlockExecution represents a succession of instructions which may be :

For instance, each class definition contains a BlockExecution :

class A {
    
    int(8)  b;
}

Here, the class definition contains a BlockExecution that itself contains two declarations.

When a file needs to be parsed, the BlockExecution will be called, and will execute its instructions, until it has parsed enough or encounters a break point (break or continue in a while loop).

Every BlockExecution owns a Scope and an Evaluator.

b. Scope

A Scope is basically a way to determined what information (objects, variables) should be accessible in a BlockExecution. In the case of a class, for instance, the Scope will be composite, containing an ObjectScope (to access the children, parent of the Object), and a LocalScope (containing all the variables declared in the class).

c. Evaluator

In case of rightValues instructions, the BlockExecution calls the Evaluator. It will be in charge of computing arithmetic operations, asigning values to , calling functions.

Example :

When a BlockExecution encounters:

var bytesize = 8 * bitsize;

it will first call the Evaluator to evaluate 8 * bitsize. The evaluator will try to find a variable or object called bitsize in its scope. It will then apply the multiplication, find the variable bytesize in the scope and assign the result to it.

The GUI

The purpose of this part is to print the information to the user in an interface. The main parts are :

1. The main window

The MainWindow is created in the main.cpp. It inherits from QMainWindow. Its goal is to print the whole interface and the menu bar at the top of the window. Therefore, it controls whatever action is done by the user in the top task bar such as opening files, recompiling HMDL files... It also creates and holds the two main widgets of the GUI : the tree widget and the hexadecimal widget.

2. The tree directory

It corresponds to the left part of the interface which prints the internal architecture of the file (parsed or not). The tree widget is common to all the opened files and holds the treeModel instances which are specific to each opened file.

The treeModel inherits from QAbstractItemModel: it is a specific way to manage the data. The fundamental concept is inclusion. Every field of the tree is a QModelIndex. A QModelIndex can have children and therefore a parent. The treeModel is like a huge array with every row representing a specific file by a root QModelIndex, which has children contained into itself, which themselves contain children and so on. To sum up, the table is a table of tables of tables... To navigate through it to a tree object item, you may use the user interaction or go from root QModelIndex down to the item QModelIndex.

The tree Object item is the only part of the tree widget directly related to the core. Each item contains a core's object. It inherits from the tree item which manages the attributes of the item useful for its representation in the tree.

Finally, the treeView which inherits from QTreeView, displays the items and manages the interaction with the user through the selected signal.

3. The hexadecimal directory

It is pretty similar to the treeWidget: the hexFileWidget is shred with all the files but each of them has a specific hexFileModel where are the functions to read data from the file. Be careful, the data does not come from the core's file object, but from a QFile only related to the core's file object by the fact they share the same path. Therefore, there are two representations of the file in the whole project: one used by the core and one used by the hexadecimal widget which is independent from the core (but not the tree widget, remember tree object item contains Objects).

The two widgets may interact: when the user clicks on some data, the corresponding data is underlined in the other widget. There are different actions available depending on the widget such as closing the file in the tree widget or editing the file in the hexadecimal widget.