Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Slower than expected Java regex performance

Tags:

java

regex

php

I have been tasked with reading large CSV files (300k+ records) and apply regexp patterns to each record. I have always been a PHP developer and never really tried any other languages, but decided I should take the dive and attempt to do this with Java which I assumed would be much faster.

In fact, just reading the CSV file line by line was 3x faster in Java. However, when I applied the regexp requirements, the Java implementation proved to take 10-20% longer than the PHP script.

It is very well possible that I did something wrong in Java, because I just learned this as I went today. Below are the two scripts, any advice would be greatly appreciated. I really would like to not give up on Java for this particular project.

PHP CODE

<?php
$bgtime=time();
$patterns =array(
    "/SOME REGEXP/",
    "/SOME REGEXP/",                    
    "/SOME REGEXP/",    
    "/SOME REGEXP/" 
);   

$fh = fopen('largeCSV.txt','r');
while($currentLineString = fgetcsv($fh, 10000, ","))
{
    foreach($patterns AS $pattern)
    {
        preg_match_all($pattern, $currentLineString[6], $matches);
    }
}
fclose($fh);
print "Execution Time: ".(time()-$bgtime);

?>

JAVA CODE

import au.com.bytecode.opencsv.CSVReader;
import java.io.FileReader;
import java.io.IOException;
import java.util.regex.Pattern;
import java.util.regex.Matcher;
import java.util.ArrayList;

public class testParser
{
    public static void main(String[] args)
    {
        long start = System.currentTimeMillis();


        String[] rawPatterns = {
                    "SOME REGEXP",
                    "SOME REGEXP",                    
                    "SOME REGEXP",    
                    "SOME REGEXP"    
        };

        ArrayList<Pattern> compiledPatternList = new ArrayList<Pattern>();        
        for(String patternString : rawPatterns)
        {
            Pattern compiledPattern = Pattern.compile(patternString);
            compiledPatternList.add(compiledPattern);
        }


        try{
            String fileName="largeCSV.txt";
            CSVReader reader = new CSVReader(new FileReader(fileName));

            String[] header = reader.readNext();
            String[] nextLine;
            String description;

            while( (nextLine = reader.readNext()) != null) 
            {
                description = nextLine[6];
                for(Pattern compiledPattern : compiledPatternList)
                {
                    Matcher m = compiledPattern.matcher(description);
                    while(m.find()) 
                    {
                        //System.out.println(m.group(0));
                    }                
                }
            }
        }

        catch(IOException ioe)
        {
            System.out.println("Blah!");
        }

        long end = System.currentTimeMillis();

        System.out.println("Execution time was "+((end-start)/1000)+" seconds.");
    }
}
like image 856
IOInterrupt Avatar asked Jul 11 '11 20:07

IOInterrupt


People also ask

Is regex slow in Java?

it takes about 40 micro second. No need to say when the number of string values exceeds a few thousands, it'll be too slow.

Can regex be slow?

The reason the regex is so slow is that the "*" quantifier is greedy by default, and so the first ". *" tries to match the whole string, and after that begins to backtrack character by character. The runtime is exponential in the count of numbers on a line.

Does regex affect performance?

Being more specific with your regular expressions, even if they become much longer, can make a world of difference in performance. The fewer characters you scan to determine the match, the faster your regexes will be.


2 Answers

Using a buffered reader might help performance get quite a bit better:

CSVReader reader = new CSVReader(new BufferedReader(new FileReader(fileName)));
like image 59
rsp Avatar answered Oct 14 '22 20:10

rsp


I don't see anything glaringly wrong with your code. Try isolating the performance bottle-neck using a profiler. I find the netbeans profiler very user-friendly.

EDIT: Why speculate? Profile the app and get a detailed report of where the time is spent. Then work to resolve the inefficient areas. See http://profiler.netbeans.org/ for more information.

EDIT2: OK, I got bored and profiled this. My code is identical to yours and parsed a CSV file with 1,000 identical lines as follows:

SOME REGEXP,SOME REGEXP,SOME REGEXP,SOME REGEXP,SOME REGEXP,SOME REGEXP,SOME REGEXP,SOME REGEXP,SOME REGEXP,SOME REGEXP

Here are the results (obviously your results will differ as my regular expressions are trivial). However, it's plain to see that the regex processing is not your main area of concern.

enter image description here

Interestingly, if I apply a BufferedReader, the performance is enhanced by a whopping 18% (see below).

enter image description here

like image 38
hoipolloi Avatar answered Oct 14 '22 20:10

hoipolloi