Pokedex : « Machoke, the Superpower Pokémon. With enough strength to lift a giant truck with one hand, Machoke are often used for extremely heavy work. »
After some time spent polishing the code, we proudly release Machoke, the CFG-based hash algorithm we introduced during a small talk at r2con in Barcelona in September.
Fuzzy hash and CFG
A fuzzy hash is a hash that can match inputs with homologies, therefore, this kind of hash is very useful in malware classification because it gives us the possibiity to classify malwares families by similarities. Currently, such type of hash includes algorithms like ssdeep, tlsh etc.
Machoke is a fuzzy hash using the control-flow graph of a sample as input for hashing. The control-flow graph is a representation, using graph notation, of all paths that might be traversed by a program during its execution (see below for an example of a CFG of a function as seen by radare2).
- ability to resist minor changes in the samples being classified
- allowing clusterisation (like ssdeep)
This implementation is based on Machoc, originally published by ANSSI during SSTIC2015 as a part of polichombr (https://github.com/ANSSI-FR/polichombr). The original algorithm is the work of @Heurs.
Our implementation is roughly the same, but unlike ANSSI’s Machoc, is implemented using radare2 and r2pipe instead of miasm or IDApython.
In order to be able to compute such hash, this algorithm uses is the following:
- List the functions of the binary
- For a given function:
- Label all the nodes (or code blocks) and calls of the graph
- Record all the connection between the nodes as a string
- Perform the murmurhash3 of the string produced
- Iterate for each functions and concatenate the strings produced
Machoke hash format
The size of the hash produced is directcly linked to the number of functions in the binary. Therefore, it is most likely that the bigger the sample you try to machoke, the bigger the machoke hash produced.
The slides are available here.