I have been tasked with reading large CSV files (300k+ records) and apply regexp patterns to each record. I have always been a PHP developer and never really tried any other languages, but decided I should take the dive and attempt to do this with Java which I assumed would be much faster.
In fact, just reading the CSV file line by line was 3x faster in Java. However, when I applied the regexp requirements, the Java implementation proved to take 10-20% longer than the PHP script.
It is very well possible that I did something wrong in Java, because I just learned this as I went today. Below are the two scripts, any advice would be greatly appreciated. I really would like to not give up on Java for this particular project.
PHP CODE
<?php
$bgtime=time();
$patterns =array(
"/SOME REGEXP/",
"/SOME REGEXP/",
"/SOME REGEXP/",
"/SOME REGEXP/"
);
$fh = fopen('largeCSV.txt','r');
while($currentLineString = fgetcsv($fh, 10000, ","))
{
foreach($patterns AS $pattern)
{
preg_match_all($pattern, $currentLineString[6], $matches);
}
}
fclose($fh);
print "Execution Time: ".(time()-$bgtime);
?>
JAVA CODE
import au.com.bytecode.opencsv.CSVReader;
import java.io.FileReader;
import java.io.IOException;
import java.util.regex.Pattern;
import java.util.regex.Matcher;
import java.util.ArrayList;
public class testParser
{
public static void main(String[] args)
{
long start = System.currentTimeMillis();
String[] rawPatterns = {
"SOME REGEXP",
"SOME REGEXP",
"SOME REGEXP",
"SOME REGEXP"
};
ArrayList<Pattern> compiledPatternList = new ArrayList<Pattern>();
for(String patternString : rawPatterns)
{
Pattern compiledPattern = Pattern.compile(patternString);
compiledPatternList.add(compiledPattern);
}
try{
String fileName="largeCSV.txt";
CSVReader reader = new CSVReader(new FileReader(fileName));
String[] header = reader.readNext();
String[] nextLine;
String description;
while( (nextLine = reader.readNext()) != null)
{
description = nextLine[6];
for(Pattern compiledPattern : compiledPatternList)
{
Matcher m = compiledPattern.matcher(description);
while(m.find())
{
//System.out.println(m.group(0));
}
}
}
}
catch(IOException ioe)
{
System.out.println("Blah!");
}
long end = System.currentTimeMillis();
System.out.println("Execution time was "+((end-start)/1000)+" seconds.");
}
}
it takes about 40 micro second. No need to say when the number of string values exceeds a few thousands, it'll be too slow.
The reason the regex is so slow is that the "*" quantifier is greedy by default, and so the first ". *" tries to match the whole string, and after that begins to backtrack character by character. The runtime is exponential in the count of numbers on a line.
Being more specific with your regular expressions, even if they become much longer, can make a world of difference in performance. The fewer characters you scan to determine the match, the faster your regexes will be.
Using a buffered reader might help performance get quite a bit better:
CSVReader reader = new CSVReader(new BufferedReader(new FileReader(fileName)));
I don't see anything glaringly wrong with your code. Try isolating the performance bottle-neck using a profiler. I find the netbeans profiler very user-friendly.
EDIT: Why speculate? Profile the app and get a detailed report of where the time is spent. Then work to resolve the inefficient areas. See http://profiler.netbeans.org/ for more information.
EDIT2: OK, I got bored and profiled this. My code is identical to yours and parsed a CSV file with 1,000 identical lines as follows:
SOME REGEXP,SOME REGEXP,SOME REGEXP,SOME REGEXP,SOME REGEXP,SOME REGEXP,SOME REGEXP,SOME REGEXP,SOME REGEXP,SOME REGEXP
Here are the results (obviously your results will differ as my regular expressions are trivial). However, it's plain to see that the regex processing is not your main area of concern.
Interestingly, if I apply a BufferedReader, the performance is enhanced by a whopping 18% (see below).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With