Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Matching many files against many patterns in Java

Here is what i want to do: on the one side, i have a text file with ~100.000 string patterns (each String is in a new line), most of them are about 40-200 characters long. On the other side, i have ~130.000 files with sizes anywhere from a just a few kiloBytes up to large files with several hundered megaBytes (however, 95% of the files are just a few 100kB).

Now, i want to match every one of the 130k files against all of the 100k patterns.

Right now i am doing the matching using the .contains() method, here is some example code:

String file = readFile(somefile.pdf); // see benchmark below
String[] patterns = readFile(patterns.txt).split("\n"); // read 100k patterns into an array
for(int i = 0; patterns.length-1; i++){
    if(file.contains(patterns[i])){
        // pattern matched 
    }else{
        // patttern not matched
    }
}

I am running this on a rather powerful desktop system (4core 2.9ghz, 4GB memory, SSD) and i get very poor performance:

When somefile.pdf is a 1.2mb file, a match against all 100k patterns takes ~43 seconds. A 400kb is ~14seconds. A 50kb file is ~2 Seconds

This is way too slow, i need something with 40x-50x times the performance. What can i do?

like image 596
user3004200 Avatar asked Jan 20 '26 01:01

user3004200


1 Answers

Creating a search index over these 130k files would probably the most sustainable approach.

A similar question was answered over here: Searching for matches in 3 million text files

Libraries / Tools that are typically used in Java environments:

  • Lucene
  • Solr
  • elasticsearch
like image 188
reto Avatar answered Jan 22 '26 16:01

reto