AVCLASS - A Tool for Massive Malware Labeling.

AVCLASS – A Tool for Massive Malware Labeling.

Labeling a malicious executable as a variant of a known family is important for security applications such as triage, lineage, and for building reference datasets in turn used for evaluating malware clustering and training malwareclassification approaches. Oftentimes, such labeling is based on labels output by antivirus engines. While AV labels are well-known to be inconsistent, there is often no other information available for labeling, thus security analysts keep relying on them. However, current approaches for extracting family information from AV labels are manual and inaccurate. In this work, we describe AVCLASS, an automatic labeling tool that given the AV labels for a, potentially massive, number of samples outputs the most likely family names for each sample. AVCLASS implements
novel automatic techniques to address 3 key challenges: normalization, removal of generic tokens, and alias detection. We have evaluated AVCLASS on 10 datasets comprising 8.9 M samples, larger than any dataset used by malware clustering and classification works. AVCLASS leverages labels from any AV engine, e.g., all 99 AV engines seen in VirusTotal, the largest engine set in the literature. AVCLASS’s clustering achieves F1 measures up to 93.9 on labeled datasets and clusters are labeled with fine-grained family names commonly used by the AV vendors. We release AVCLASS to the community.

Why is AVClass useful?
Because a lot of times security researchers want to extract family information from AV labels, but this process is not as simple as it looks, especially if you need to do it for large numbers (e.g., millions) of samples. Some advantages of AVClass are:
1. Automatic. AVClass removes manual analysis limitations on the size of the input dataset.
2. Vendor-agnostic. AVclass operates on the labels of any available set of AV engines, which can vary from sample to sample.
3. Cross-platform. AVclass can be used for any platforms supported by AV engines, e.g., Windows or Android malware.
4. Does not require executables. AV labels can be obtained from online services like VirusTotal using a sample’s hash, even when the executable is not available.
5. Quantfied accuracy. We have evaluated AVClass on 5 publicly available malware datasets with ground truth. Details are in the above RAID 2016 paper.
6. Open source. The code is available and we are happy to incorporate suggestions and improvements so that the security community benefits from AVClass.

AVClass malware labeling tool

AVClass malware labeling tool

The main limitation of AVClass is that its output depends on the input AV labels. It tries to compensate for the noise on those labels, but cannot identify the family of a sample if AV engines do not provide non-generic family names to that sample. In particular, it cannot label samples if at least 2 AV engines do not agree on a non-generic family name. Results on 8 million samples showed that AVClass could label 81% of the samples. In other words, it could not label 19% of the samples because their labels contained only generic tokens.

The labeler takes as input a JSON file with the AV labels of malware samples (-vt or -lb switches), a file with generic tokens (-gen switch), and a file with aliases (-alias switch). It outputs the most likely family name for each sample. If you do not provide alias or generic tokens files, the default ones in the data folder are used.

Preparation: Generic Token Detection
The labeling takes as input a file with generic tokens that should be ignored in the AV labels, e.g., trojan, virus, generic, linux. By default, the labeling uses the data/default.generics generic tokens file. You can edit that file to add additional generic tokens you feel we are missing.

Preparation: Alias Detection
Different vendors may assign different names (i.e., aliases) for the same family. For example, some vendors may use zeus and others zbot as aliases for the same malware family. The labeling takes as input a file with aliases that should be merged. By default, the labeling uses the data/default.aliases aliases file. You can edit that file to add additional aliases you feel we are missing.

Ground truth evaluation
If you have ground truth for some malware samples, i.e., you know the true family for those samples, you can evaluate the accuracy of the labeling output by AVClass on those samples with respect to that ground truth. The evaluation metrics used are precision, recall, and F1 measure. See our RAID 2016 paper above for their definition.

For details please read here: https://software.imdea.org/%7Ejuanca/papers/avclass_raid16.pdf

Source: https://github.com/malicialab