nlhbi-malware-extractor : Natural Language Host-Based Indicators Malware Extraction Utility.

The module creates a MalwareSample object given a path to a PE32 binary executable file that has extracted section names, imported and exported symbols, filenames, urls, and strings (minus strings containing anything already extracted) from the binary file as fields/properties.

The module creates a list of MalwareSample objects and feeds this list (one object at a time) to the process_malware_sample() function.

Currently that processing function only has one additional call—that to extract_nl_text(). extract_nl_text() iterates through a MalwareSample object’s strings property/field and tokenizes strings containing natural language text, using WordNet to identify whether tokens in the string are valid English-language words or not. A list of tokenized strings is returned (which is a list of lists). The intent is for these tokenized strings containing natural language text to be used as host-based indicators for malware, perhaps after additional natural language processing or computational linguistic analysis is performed.

Libraries Used :
+ pefile (Portable Executable reader module)
+ PEframe

pefile – Portable Executable reader module


pefile will allow to access from any Python script all (or most) of the contents
of a given PE file.

The structures defined in the Windows header files will be accessible as the
PE instance attributes and will have the same names as defined there.
(The main structures will have the standard capitalized names and will be
attributes of the PE instance. Their members will be attributes.)

Other attributes and data, which require further processing but are very useful
will be available as lowercase attributes. Some of those are, the imported and
exported symbols and the sections, with direct access to their data (if any) and
convenient methods to retrieve data based on the address as if the file were
loaded, instead of needing to dig the offsets into the file.


Starting from pefile 1.2 it’s possible to write back any changes done to the PE
file. One has to be careful with this functionality as it will not be very
intelligent reconstructing the PE file. That is, it will not handle displacing
structures if that would be needed because a new section has been added.
The rule of thumb is, if there’s room for an additional header/structure to fit
then there’ll be no problem and pefile will write it.
All other modifications, i.e. changing individual values in header/structure
members should work well.
One possible useful application of this could be to correct malformed headers
used by some malware in order to cause certain analysis tools to malfunction.

Last versions are available at:


Just importing it should suffice. The module should be endianness independent and
it’s known to work on OS X, Windows, and Linux.


There might be some obscure info which is not readily accessible, this may be
due to my ignorance or laziness. Patches or suggestions are, as usual, welcomed.

Thinks known to be missing so far:

-Reading and processing the exceptions directory entry. (Architecture dependent info)

Download :  | or git clone
Course & Paper :
This code was written for Dr. Sam Liles‘CNIT 58100 Cyber Forensics of Malware course and a consequent paper submitted to the DFRWS USA 2015 Annual Conferenc