If you have not read my introduction to this series, I suggest doing so. It gives the goals and plan behind this series as well as a slight introduction to theZoo, the OpenSource malware repository.
The purpose of this post is to give an overview of theZoo as a whole and some basic analysis.
For this tour, I’ve created a zoo map in order to set expectations for what we will find in theZoo.

At the time of this writing (December 2020), theZoo has 237 distinct directories of malware with ~398 binaries/files. As in the graphic above: 284 PE32 Windows executables, 13 MS-DOS executables, 12 Excel or Word documents, 10 DOS (COM) executables, 8 ELF executables, 8 JAR files, 4 FreeBSD executables, 3 OS X executables. The remainder consists of gifs, jpgs, and a random other things. (The numbers may not be exact, but they fit our needs right now. Exact numbers would require more thorough analysis. I gathered the numbers using the programs file, wc, and grep.)
As gathered from that data, the majority of the time analyzing theZoo will be spent analyzing Windows malware.
If you are interested in what is included in theZoo, you can review the directory names on the GitHub. If you are interested in contributing an analysis of one of those binaries, see the Contact page and let me know.
Age of Samples/Growing Collection
The numbers I gave were from data gathered in December 2020, and I don’t expect them to change much.
The repository was originally created in a way that researchers could upload malware to the repository. However, the frequency and volume of uploads has nearly stopped. The majority of the malware was uploaded on average 3.5 years ago (median age of 3 years). The most recent being 3 months ago, the oldest being uploaded 7 years ago. The samples themselves may be even older.
Overview Analysis
By running automated analysis tools, we can gain a overview of theZoo. Our tools for this analysis are scripts to compare the malware to find related binaries: this can give us a little knowledge of what to expect.
Import Table Similarity Graph
We are first going to compare the import tables from the malware. Imports are functions used by a program that are stored elsewhere such as in a code-library or, in the case of Windows, DLL files. For our analysis below, we are only able to evaluate Windows Portable Executables (PE).

Above is a graph that maps similarity between PE32 binaries. A line is drawn between samples when the Jaccard Index of the Import Table is greater than 80% for two samples. (The Jaccard Index is a mathematical way to calculate similarity between two samples. Think of it generating a Venn Diagram and finding the overlapping area.) The idea is that if the Import Table of one binary is similar to another binary, there is a probability that they are related or have closely related functionality.
Though not all the names are readable (since their hash had been used when stored in theZoo), our analysis shows many related samples. For example, the Potao samples create a large cluster in the middle left-of-center of the graph. There are other Potao samples, like “Potao_Drop” and “Potao_Fake”, which are in separate clusters of 3 or 4 samples. Looking at theZoo, we find that analysts often uploaded these together into the same directory. This analysis primarily confirms that they share a large amount of imported functionality and are likely related.
Some samples may be connected in the graph due to “packing”, a type of obfuscation which hides the list of imports. Obfuscation would cause a few common imports to appear in binaries and cause a false positive as it related to comparing them in this way.
Strings Similarity Graph
Another possible comparison is to evaluate the human-readable strings contained within the Binaries.

This graph was generated by getting the human readable strings from the executables and using the Jaccard Index to compare them, again looking for a 80% similarity in the strings.
This script again found that the samples of Potao_Fake, Potao_Debu and others were closely related. In comparison to the import table graph, we do not see as many binaries having common strings. Again, the majority of binaries marked as similar to one another were uploaded together by researchers as a malware family; our graph just suggests those are related by one method.
Graph Conclusion
So far, I have only generated these graphs with these two metrics. If I find additional metrics I can easily implement, I will. The graphs are somewhat what I would expect: I expected the majority of the malware samples to be unrelated to one another but the families that were uploaded by researchers are also identified.
We can reasonably assume some samples to be very similar and as a result, multiple samples can be included in an individual blog post instead of receiving their independent analysis.
Theory
The theory and most of the code for generating these graphs comes from Malware Data Science by Joshua Saxe and Hillary Sanders. They use full chapters to discuss the theory behind the graphs so I will not discuss it in depth here. They also include full chapters on the theory and method of building Machine Learning programs to identify malware. The book is well worth owning and reading.
Extra
It is worth noting that I have done a few other things behind the scenes. If you are interested in doing your own analysis, I can contribute some tools. I’ve written some scripts for unzipping all the malware from theZoo and another to attempt to rename binaries in order to hopefully make the names on the graphs human-readable.