Programmatical approach in Java for file comparison

Tags:

What would be the best approach to compare two hexadecimal file signatures against each other for similarities.

More specifically, what I would like to do is to take the hexadecimal representation of an .exe file and compare it against a series of virus signature. For this approach I plan to break the file (exe) hex representation into individual groups of N chars (ie. 10 hex chars) and do the same with the virus signature. I am aiming to perform some sort of heuristics and therefore statistically check whether this exe file has X% of similarity against the known virus signature.

The simplest and likely very wrong way I thought of doing this is, to compare exe[n, n-1] against virus [n, n-1] where each element in the array is a sub array, and therefore exe1[0,9] against virus1[0,9]. Each subset will be graded statistically.

As you can realize there would be a massive number of comparisons and hence very very slow. So I thought to ask whether you guys can think of a better approach to do such comparison, for example implementing different data structures together.

This is for a project am doing for my BSc where am trying to develop an algorithm to detect polymorphic malware, this is only one part of the whole system, where the other is based on genetic algorithms to evolve the static virus signature. Any advice, comments, or general information such as resources are very welcome.

Definition: Polymorphic malware (virus, worm, ...) maintains the same functionality and payload as their "original" version, while having apparently different structures (variants). They achieve that by code obfuscation and thus altering their hex signature. Some of the techniques used for polymorphism are; format alteration (insert remove blanks), variable renaming, statement rearrangement, junk code addition, statement replacement (x=1 changes to x=y/5 where y=5), swapping of control statements. So much like the flu virus mutates and therefore vaccination is not effective, polymorphic malware mutates to avoid detection.

Update: After the advise you guys gave me in regards what reading to do; I did that, but it somewhat confused me more. I found several distance algorithms that can apply to my problem, such as;

Longest common subsequence
Levenshtein algorithm
Needleman–Wunsch algorithm
Smith–Waterman algorithm
Boyer Moore algorithm
Aho Corasick algorithm

But now I don't know which to use, they all seem to do he same thing in different ways. I will continue to do research so that I can understand each one better; but in the mean time could you give me your opinion on which might be more suitable so that I can give it priority during my research and to study it deeper.

Update 2: I ended up using an amalgamation of the LCSubsequence, LCSubstring and Levenshtein Distance. Thank you all for the suggestions.

There is a copy of the finished paper on GitHub

583

asked Nov 01 '10 10:11

Carlos

2 Answers

For algorithms like these I suggest you look into the bioinformatics area. There is a similar problem setting there in that you have large files (genome sequences) in which you are looking for certain signatures (genes, special well-known short base sequences, etc.).

Also for considering polymorphic malware, this sector should offer you a lot, because in biology it seems similarly difficult to get exact matches. (Unfortunately, I am not aware of appropriate approximative searching/matching algorithms to point you to.)

One example from this direction would be to adapt something like the Aho Corasick algorithm in order to search for several malware signatures at the same time.

Similarly, algorithms like the Boyer Moore algorithm give you fantastic search runtimes especially for longer sequences (average case of O(N/M) for a text of size N in which you look for a pattern of size M, i.e. sublinear search times).

107

answered Oct 05 '22 05:10

Frank

A number of papers have been published on finding near duplicate documents in a large corpus of documents in the context of websearch. I think you will find them useful. For example, see this presentation.

answered Oct 05 '22 05:10

Amit Prakash

Related questions
                            
                                Is it necessary to call a flush() (JPA interface) in this situation?
                            
                                How to know what made a hibernate persisted object dirty?
                            
                                NetBeans and Eclipse-like "run configurations"
                            
                                Atlassian JIRA Plugin Development: How to make variables available to velocity templates
                            
                                Is there a way to force Checkstyle to ignore particular warning in the source code? [duplicate]
                            
                                Disable Word wrap in JTextPane
                            
                                How to securely store a PrivateKey in code [duplicate]
                            
                                EJB3 - obtaining bean via injection vs lookup - what are the differences, implications, gotchas?
                            
                                Detecting self crossing in closed Bezier curves
                            
                                How to debug Java Web Start applications?
                            
                                Java collections that spool to disk
                            
                                Java - Difference between SwingWorker and SwingUtilities.invokeLater()
                            
                                Calling Java vararg method from Scala with primitives
                            
                                Suggestions for a java Mock File (to mock java.io.File)
                            
                                Get declared methods in order they appear in source code
                            
                                WebKit browser in Java app on multiple platforms
                            
                                The various options for solving PermGen problems
                            
                                Problems with generating sql via eclipseLink - missing separator
                            
                                How 'expensive' is it to execute jstack on a running JVM?
                            
                                Using JAXB to pass subclass instances as superclass

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Programmatical approach in Java for file comparison

Tags:

java

algorithm

data-structures

distance

file-comparison

Carlos

People also ask

2 Answers

Frank

Amit Prakash

Recent Activity

Donate For Us