HexaMonkey - An open-source binary file analyzer

HMDL Documentation

What is HMDL?

HMDL stands for HexaMonkey Description Language, it is an imperative object-oriented programming language used to describe how files of a given container format should be parsed. It can be interpreted by HexaMonkey in order to parse files.

The basic principle is that containers format files are composed of nested boxes. A model object is therefore used to describe the parsing, the parsers are written as enhanced class definitions which make them very simple and expressive while still offering high-level functionnalities such as loops, conditional branching and local variables.

Getting started

First of all create a blank file with a .hm extension and put in the script subfolder of your HexaMonkey installation folder. Then basic layout of the file shall be:

  1. Format detection directives
  2. Module importations
  3. Class definitions

For instance a very basic HMDL file would be:

//(1) Format detection directives
addExtension dummy

//(2)Module importation
import bestd

//(3) Class definitions
class DummyFile as File
{
    uint(32) payload[];
}

This description is interpreted as so:

  1. This tells to HexaMonkey that if the file has dummy as an extension then this module should be used. (More on format detection)
  2. This loads the bestd module which stands for big endian standard (http://en.wikipedia.org/wiki/Endianness). It allows the use of uint. (More on modules)
  3. This declares that the file should be parsed an array of 32 bits unsigned integers. (More on classes)

Format detection directives

The format detection directives are standardized predicates applied to files to check if the module should be loaded to parse them. The standardizes operation are the following:

Several directives can be specified for a same file. Globally the first positive result will stand, the priority being magicNumber first, then syncbyte and finally extension.

Modules

Modules can be imported very simply by using the command : import moduleName. All the classes available in the imported module will be directly available. If there is a naming conflict, the last class definition will be chosen.

Default Module

This module is always loaded, it contains fundamental classes :

Standard Module

This module contain basic classes such as integers, strings, floats... Two versions of this module exists : a big-endian one under the code name bestd and a little-endian one under the code name lestd (http://en.wikipedia.org/wiki/Endianness).

Variable values

HMDL is weakly typed, uinitialized variables default to NULL. Internally variable can either be of type :

Most operators are implemented and will have the same effect as in the c langage.

Class Definition

Classes are defined by a unique identifier and a number of parameters as so:

class className(param_1, param_2 ... param_n) 
{
    statements
}

If n is zero, the parenthesis are optional.

The body of the class definition is a series of statements nested in curly braces which will be interpreted sequentially. The braces are optional if there is one or less statement. The different possible statements are:

Object manipaluation

Member access

Once members have been declared they can be accessed in different ways:

In case of ambiguity the last member most recently defined will be taken. The access can be chained, using . as a delimiter for identifiers. A valid look-up could be for instance: array[5].index

Reserved variables

Reserved variables are defined automaticaly for every objects, some are constant and others modifiable. They all begin with @ :

Functions

On top standard operators, functions can be used to transform values or generate new ones. The syntax for a function call is the following :

%function_name(value_1, ...,value_n)

The functions are defined in modules and can be imported in the same way as classes. Function are defined in the top level of file as such:

function functionName(param_1, param_2 ... param_n) 
{
    statements
}

The statements available are the same as the ones used in a class definition except for the declaration. In addition a return statement can be used as such:

return value;

If the end of the block is reached with no return statement then the value returned by the function is NULL. By default the parameters are passed by reference, which mean that the value of the parameter can be modified. The keyword const can be added in front of the parameter name in the function definition to insure that the value will not be modified. In this case if the value of the parameter is attempted to be modified then the value will be copied and the operation will be done on the copy.

Default values for the parameters can be given as such :

param = value

If not enough parameters are given to a function call, then the default value will be given to the unspecified parameters. If no default parameter has been given then the value will be NULL

Inheritance

HMDL implements two complementary inheritance models : the extension and the specification.

Extension

A extends B means that B's parser should be prepended to A's parser. Which mean that defining A as such:

class B
{
    statement_1
    statement_2
    ...
    statement_k
}

class A extends B
{
    statement_k+1
    ...
    statement_n
}

is equivalent to defining A as such:

class A
{
    statement_1
    statement_2
    ...
    statement_n
}

The syntax for extension is the following:

class className(param_1, param_2 ... param_n) extends parentName(value_1, value_2 ... value_k)
{
    statements
}

The parameter values for the parent object type are expression that can use the parameter values of the children class as variable. As for instance:

class IntWrapper(size) extends Data(32+size)
{
    String(4) name;
    int(size) payload;
    @value = payload; 
}

Specification

A specifies B means that if the type of the object is B then it should be parsed as A. If A specifies B then A should extend B but not necessarely directly. The syntax is :

forward parentName( ... ) to childName( ... )

For convenience, the keyword as can be used in class definition :

class A(...) extends C(...) as B(...)
{
    statements
}

class A(...) extends C(...)
{
    statements
}

forward B(...) to A()

and

class A(...) as B(...)
{
    statements
}

class A(...) extends B( ... )
{
    statements
}
forward B(...) to A()

For instance, the File object can be specified to define the top-level object of the format.

The specification feature, relies heavily on the fact that the argument values the object type can be changed during the parsing. Therefore a basic format can be for instance implemented as so:

class DummyFile as File
{
    while(1) Container *;
}
 
class Container(code)
{
    String(4) code;
    int(32) size;
    @args.code = code;
    @size = size;
}

class VideoContainer as Container("vide")
{
    ...
}

class AudioContainer as Container("audi")
{
    ...
}

class IndexContainer as Container("indx")
{
    ...
}

...

Lazy parsing

HexaMonkey relies heavily on lazy parsing, which means that the parsing is done only when necessary. This allows to navigate through large files without having to parse them completely which could take several minutes and take up several gigabytes of memory. Therefore each object is first parsed until its basic information are known, such as its size, type, value and showcased values.

Knowing the size