HexaMonkey - An open-source binary file analyzer
HMDL Documentation
What is HMDL?
HMDL stands for HexaMonkey Description Language, it is an imperative object-oriented programming language used to describe how files of a given container format should be parsed. It can be interpreted by HexaMonkey in order to parse files.
The basic principle is that containers format files are composed of nested boxes. A model object is therefore used to describe the parsing, the parsers are written as enhanced class definitions which make them very simple and expressive while still offering high-level functionnalities such as loops, conditional branching and local variables.
Getting started
First of all create a blank file with a .hm extension and put in the script subfolder of your HexaMonkey installation folder. Then basic layout of the file shall be:
- Format detection directives
- Module importations
- Class definitions
For instance a very basic HMDL file would be:
//(1) Format detection directives addExtension dummy //(2)Module importation import bestd //(3) Class definitions class DummyFile as File { uint(32) payload[]; }
This description is interpreted as so:
- This tells to HexaMonkey that if the file has dummy as an extension then this module should be used. (More on format detection)
- This loads the bestd module which stands for big endian standard (http://en.wikipedia.org/wiki/Endianness). It allows the use of uint. (More on modules)
- This declares that the file should be parsed an array of 32 bits unsigned integers. (More on classes)
Format detection directives
The format detection directives are standardized predicates applied to files to check if the module should be loaded to parse them. The standardizes operation are the following:
-
addMagicNumber magicNumber : This operation relies on magic numbers a.k.a. file signatures (http://en.wikipedia.org/wiki/Magic_number_(programming)). The magicNumber should be a succession of bytes that must be present in the beginning of the file. The bytes should be written in hex code and space separated. If a value of a byte can vary you can mark it as xx. If a file matches several magic numbers the longest will be chosen. You can find here a large number of magic numbers.
Example :
addMagicNumber 00 00 00 xx 66 74 79 70 //Magic number for mp4: ....ftyp
-
addSyncbyte syncbyte periodicity : This operation searches for a byte repeating with a fixed periodicity. This is useful for formats composed of packets beginning with a fixed synchronisation byte.
Example :
addSyncbyte 0x47 188 //Syncbyte for mpeg2-ts : packets of size 188 bytes //beginning by the syncbyte 0x47
-
addExtension extension : This operation associate files with a given extension with this module.
Example :
addExtension mp4 //Association of the files with a .mp4 extension
Several directives can be specified for a same file. Globally the first positive result will stand, the priority being magicNumber first, then syncbyte and finally extension.
Modules
Modules can be imported very simply by using the command : import moduleName. All the classes available in the imported module will be directly available. If there is a naming conflict, the last class definition will be chosen.
Default Module
This module is always loaded, it contains fundamental classes :
- File : This class will be called to parse the file. It should be specified to define the top-level structure of the format. (more on inheritance).
- Array(type, size) : Objects of type type are parsed in the area of size size given (in bits). size is optionnal and defaults to the remaining size available in the container object. For convenience
Array(type) array;
is strictly equivalent totype array[];
- Tuple(type, count) : Objects of type type are parsed count times. For convenience
Tuple(type, 8) tuple;
is strictly equivalent totype tuple[8];
- Data(size) : Uninterpreted data is parsed in the area of size size given (in bits). size is optionnal and defaults to the remaining size available in the container object.
Standard Module
This module contain basic classes such as integers, strings, floats... Two versions of this module exists : a big-endian one under the code name bestd and a little-endian one under the code name lestd (http://en.wikipedia.org/wiki/Endianness).
- int(size, base), uint(size, base): An integer of size size given (in bits) is parsed. The size must be between 1 and 64. The value will be displayed with the base given, the bases can be 8, 10 and 16 and the default is 10.
- float, double: A floating number of size 32 bits and 64 bits respectively is parsed. The parsing complies with the IEEE Standard for Floating-Point Arithmetic (IEEE 754).
- String(charCount) : A string of characters is parsed. By default the parsing will stop the null character is reached or the end of the container. A fixed number of character can be set with the charCount parameter.
- Bitset(bitCount) : A string of bitCount bits is parsed. The size must be between 1 and 64. The object can be manipulated as an unsigned integer.
Variable values
HMDL is weakly typed, uinitialized variables default to NULL
. Internally variable can either be of type :
- unsigned integer coded on 64 bits which allows values from 0 to 18,446,744,073,709,551,615. It can be given in base 2 (
0b10000
), base 8 (020
), base 10 (16
) or base 16 (0x10
). - signed integer coded on 64 bits with two's-complement which allows values from -9,223,372,036,854,775,808 to 9,223,372,036,854,775,807. It can be defined in the same way as unsigned integers but also allows negative integers :
-0b10000
,-020
,-16
or-0x10
. Positive integer are stored as unsigned integers but are converted to signed integers if they become negative. - floating with double precision according to the IEEE Standard for Floating-Point Arithmetic (IEEE 754), ie 1-bit for sign, an exponent coded on 11 bits and a mantissa on 52-bits. It can be defined as so :
1.2
,1.
,.1
,-1.
,2e-7
... - string coded as a null-terminated string of ASCII characters. It can be defined as so :
"example"
. - object type defined by a class with its parameters specified. In can be defined as so : className(value_1, value_2...value_k). Be careful, even if k is 0 the parenthesis are mandatory in this case.
- unknown which is limited to the reserved value
NULL
. It represent the default value on initialisation and result to an operation with undefined behavior.
Most operators are implemented and will have the same effect as in the c langage.
Class Definition
Classes are defined by a unique identifier and a number of parameters as so:
class className(param_1, param_2 ... param_n) { statements }
If n is zero, the parenthesis are optional.
The body of the class definition is a series of statements nested in curly braces which will be interpreted sequentially. The braces are optional if there is one or less statement. The different possible statements are:
Declaration parses an object with a given object type:
className(value_1, value_2...value_k) name;
Here k can be strictly lower than n, in which case the remaining parameters are simply set to NULL. If k is 0 the parenthesis are optional. Displaying a name is optional, * can be used instead to create an anonymous object.
The object type can also be defined through a variable. In which case the variable as to be surronded by parenthesis as so:
var i = className(value_1, value_2...value_k); (i) name;
Local variable declaration declares a variable and initialize is as so:
var i = value;
It is also possible to leave the variable uinitialized, in which case the value will be
NULL
. The scope of the local variable is the rest of the class definition.Expression simply evaluates an expression. For instance :
i = 12;
++i;
Conditional branchement executes one block or another depending on the value of a condition. It can be defined as such:
if(condition) { then_statements } else { else_statements }
The
else
is optional and so are the curly braces if there is only one statement. The condition is simply an expression that is implicitly converted to a boolean.Loop execute a block of statements while the condition is true . It can be defined as such:
while(condition) { statements }
The for notation can also be used:
for(initialisation_statement; condition; end_of_loop_statement) { statements }
≡
initialisation_statement; while(condition) { statements end_of_loop_statement; }
Object manipaluation
Member access
Once members have been declared they can be accessed in different ways:
- By name : a member can be accessed by its name using directly its identifier as such
name
or using an expression evaluating to a string as such["name"]
. Local variables are access in the same way. - By rank : a member can be accessed by its rank i.e. access the ith member of the object using an expression evaluating to an integer as such
[i]
. - By object type : a member can be accessed by its object type using an expression evaluating to an object type as such
[className(value_1, value_2...value_k)]
.
In case of ambiguity the last member most recently defined will be taken. The access can be chained, using . as a delimiter for identifiers. A valid look-up could be for instance:
array[5].index
Reserved variables
Reserved variables are defined automaticaly for every objects, some are constant and others modifiable. They all begin with @ :
- @size (modifiable): Represents the size of the object given in bits. Specifying the size in advance is a very important tool for lazy parsing (more on lazy parsing).
- @args (modifiable): A virtual object representing the parameters values. You can access the ith parameter of name param_i either by its rank as so :
@args[i]
or by its name as so :@args["param_i"]
or@args.param_i
. - @value (modifiable): represents the internal value associated with the object. Standard objects such as
int
,uint
,float
,double
,String
andBitset
will have the expected internal values. If you try to evaluate an object as a variable then the value will be called. For instance you can write an integer wrapper as so:
class IntWrapper { String(4) name; int(32) payload; @value = payload; //Equivalent to : @value = payload.@value; }
@info
is NULL
then the value displayed will be the standard representation of @value
-1
if the object is top-level.Functions
On top standard operators, functions can be used to transform values or generate new ones. The syntax for a function call is the following :
%function_name(value_1, ...,value_n)
The functions are defined in modules and can be imported in the same way as classes. Function are defined in the top level of file as such:
function functionName(param_1, param_2 ... param_n) { statements }
The statements available are the same as the ones used in a class definition except for the declaration. In addition a return statement can be used as such:
return value;
If the end of the block is reached with no return statement then the value returned by the function is NULL
. By default the parameters are passed by reference, which mean that the value of the parameter can be modified. The keyword const can be added in front of the parameter name in the function definition to insure that the value will not be modified. In this case if the value of the parameter is attempted to be modified then the value will be copied and the operation will be done on the copy.
Default values for the parameters can be given as such :
param = value
If not enough parameters are given to a function call, then the default value will be given to the unspecified parameters. If no default parameter has been given then the value will be NULL
Inheritance
HMDL implements two complementary inheritance models : the extension and the specification.
Extension
A extends B means that B's parser should be prepended to A's parser. Which mean that defining A as such:
class B { statement_1 statement_2 ... statement_k } class A extends B { statement_k+1 ... statement_n }
is equivalent to defining A as such:
class A { statement_1 statement_2 ... statement_n }
The syntax for extension is the following:
class className(param_1, param_2 ... param_n) extends parentName(value_1, value_2 ... value_k) { statements }
The parameter values for the parent object type are expression that can use the parameter values of the children class as variable. As for instance:
class IntWrapper(size) extends Data(32+size) { String(4) name; int(size) payload; @value = payload; }
Specification
A specifies B means that if the type of the object is B then it should be parsed as A. If A specifies B then A should extend B but not necessarely directly. The syntax is :
forward parentName( ... ) to childName( ... )
For convenience, the keyword as can be used in class definition :
class A(...) extends C(...) as B(...) { statements } |
≡ |
class A(...) extends C(...) { statements } forward B(...) to A() |
and
class A(...) as B(...) { statements } |
≡ |
class A(...) extends B( ... ) { statements } forward B(...) to A() |
For instance, the File object can be specified to define the top-level object of the format.
The specification feature, relies heavily on the fact that the argument values the object type can be changed during the parsing. Therefore a basic format can be for instance implemented as so:
class DummyFile as File { while(1) Container *; } class Container(code) { String(4) code; int(32) size; @args.code = code; @size = size; } class VideoContainer as Container("vide") { ... } class AudioContainer as Container("audi") { ... } class IndexContainer as Container("indx") { ... } ...
Lazy parsing
HexaMonkey relies heavily on lazy parsing, which means that the parsing is done only when necessary. This allows to navigate through large files without having to parse them completely which could take several minutes and take up several gigabytes of memory. Therefore each object is first parsed until its basic information are known, such as its size, type, value and showcased values.
Knowing the size
-
Implicit method for fixed size classes : For simple classes composed only of a fixed number of members with fixed sizes the size will be computed automaticaly as long as the attribute
@size
is not modified and that the class does not extend another.For instance
class IntWrapper() { String(4) name; int(32) payload; }
The size of object will always be 64 (
%sizeof(IntWrapper())==64
) and the two members will be parsed only when asked specifically.class Container(code) { String(4) code; int(32) size; @args.code = code; @size = size; }
Here however the size is modified (as well as the type) and it is therefore necessary to parse the whole object in order to know its size.
-
Explicit method for size defined on construction
class IntWrapper(size) { String(4) name; int(@args.size) payload; }
In this case the size is variable, but can be computed on construction. There is two solutions to avoid unnecessary parsing in this case:
It is possible to set the size by setting directly the attribute@size
:class IntWrapper(size) { @size = 32 + @args.size; String(4) name; int(@args.size) payload; }
However here
%sizeof(IntWrapper(32))==NULL
, which mean that containers such asTuple(IntWrapper(32), 32)
won't know their size and therefore won't use lazy parsing. The way to insure that the%sizeof
function give the expected results is for the class to extend a class that knows its size on construction such asData
:class IntWrapper(size) extends Data(size+32) { String(4) name; int(@args.size) payload; }
In this case there will be indeed be
%sizeof(IntWrapper(64))==64
. -
Explicit method for variable size
When the size is variable the size should simply be set as soon as possible by setting the attribute
@size
. This is the case when the size is specified by a member of the class :class Container(code) { String(4) code; int(32) size; @args.code = code; @size = size; ... }