Data in the form of search strings continue to grow as new virus variants are released, which prompts my question - how do AV engines search files for known signatures so efficiently? If I download a new file, my AV scanner rapidly identifies the file as being a threat or not, based on its signatures, but how can it do this so quickly? I'm sure by this point there are hundreds of thousands of signatures.
Antivirus products use signature-based detection in conjunction with a database. When scanning a computer, they'll search for footprints matching those of known malware. These malware footprints are stored in a database. Antivirus products essentially search for the footprints of known malware.
A virus signature (also known as a virus definition) is a file or multiple files that are downloaded by a security program to identify a computer virus. The files enable detection of malware by the antivirus (and other antimalware) software in conventional file scanning and breach detection systems.
Updating. Signature files or definitions are an important part of how antivirus and antimalware software works. These files contain information about different viruses and malware, which is used by the software to detect, clean, and remove detected threats.
UPDATE: As tripleee pointed out, the Aho-Corasick algorithm seems very relevant to virus scanners. Here is some stuff to read:
http://www.dais.unive.it/~calpar/AA07-08/aho-corasick.pdf
http://www.researchgate.net/publication/4276168_Generalized_Aho-Corasick_Algorithm_for_Signature_Based_Anti-Virus_Applications/file/d912f50bd440de76b0.pdf
http://jason.spashett.com/av/index.htm
Aho-Corasick-like algorithm for use in anti-malware code
Below is my old answer. Its still relevant for easily detecting malware like worms which simply make copies of themselves:
I'll just write some of my thoughts on how AVs might work. I don't know for sure. If someone thinks the information is incorrect, please notify me.
There are many ways in which AVs detect possible threats. One way is signature-based detection.
A signature is just a unique fingerprint of a file (which is just a sequence of bytes). In terms of computer science, it can be called a hash. A single hash could take about 4/8/16 bytes. Assuming a size of 4 bytes (for example, CRC32), about 67 million signatures could be stored in 256MB.
All these hashes can be stored in a signature database. This database could be implemented with a balanced tree structure, so that insertion, deletion and search operations can be done in O(logn)
time, which is pretty fast even for large values of n
(n is the number of entries). Or else if a lot of memory is available, a hashtable can be used, which gives O(1)
insertion, deletion and search. This is can be faster as n
grows bigger and a good hashing technique is used.
So what an antivirus does roughly is that it calculates the hash of the file or just its critical sections (where malicious injections are possible), and searches its signature database for it. As explained above, the search is very fast, which enables scanning huge amounts of files in a short amount of time. If it is found, the file is categorized as malicious.
Similarly, the database can be updated quickly since insertion and deletion is fast too.
You could read these pages to get some more insight.
Which is faster, Hash lookup or Binary search?
https://security.stackexchange.com/questions/379/what-are-rainbow-tables-and-how-are-they-used
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With