I have been tasked with reading large CSV files (300k+ records) and apply regexp patterns to each record. I have always been a PHP developer and never really tried any other languages, but decided I should take the dive and attempt to do this with Java which I assumed would be much faster. In fact, just reading the CSV file line by line was 3x faster in Java. However, when I applied the regexp requirements, the Java implementation proved to take 10-20% longer than the PHP script. It is very well possible that I did something wrong in Java, because I just learned this as I went today. Below are the two scripts, any advice would be greatly appreciated. I really would like to not give up on Java for this particular project. PHP CODE <pre class="prettyprint"><code><?php $bgtime=time(); $patterns =array( "/SOME REGEXP/", "/SOME REGEXP/", "/SOME REGEXP/", "/SOME REGEXP/" ); $fh = fopen('largeCSV.txt','r'); while($currentLineString = fgetcsv($fh, 10000, ",")) { foreach($patterns AS $pattern) { preg_match_all($pattern, $currentLineString[6], $matches); } } fclose($fh); print "Execution Time: ".(time()-$bgtime); ?> </code></pre> JAVA CODE <pre class="prettyprint"><code>import au.com.bytecode.opencsv.CSVReader; import java.io.FileReader; import java.io.IOException; import java.util.regex.Pattern; import java.util.regex.Matcher; import java.util.ArrayList; public class testParser { public static void main(String[] args) { long start = System.currentTimeMillis(); String[] rawPatterns = { "SOME REGEXP", "SOME REGEXP", "SOME REGEXP", "SOME REGEXP" }; ArrayList<Pattern> compiledPatternList = new ArrayList<Pattern>(); for(String patternString : rawPatterns) { Pattern compiledPattern = Pattern.compile(patternString); compiledPatternList.add(compiledPattern); } try{ String fileName="largeCSV.txt"; CSVReader reader = new CSVReader(new FileReader(fileName)); String[] header = reader.readNext(); String[] nextLine; String description; while( (nextLine = reader.readNext()) != null) { description = nextLine[6]; for(Pattern compiledPattern : compiledPatternList) { Matcher m = compiledPattern.matcher(description); while(m.find()) { //System.out.println(m.group(0)); } } } } catch(IOException ioe) { System.out.println("Blah!"); } long end = System.currentTimeMillis(); System.out.println("Execution time was "+((end-start)/1000)+" seconds."); } } </code></pre>

Using a buffered reader might help performance get quite a bit better: <pre class="prettyprint"><code>CSVReader reader = new CSVReader(new BufferedReader(new FileReader(fileName))); </code></pre>

Slower than expected Java regex performance

Tags:

java

regex

php

I have been tasked with reading large CSV files (300k+ records) and apply regexp patterns to each record. I have always been a PHP developer and never really tried any other languages, but decided I should take the dive and attempt to do this with Java which I assumed would be much faster.

In fact, just reading the CSV file line by line was 3x faster in Java. However, when I applied the regexp requirements, the Java implementation proved to take 10-20% longer than the PHP script.

It is very well possible that I did something wrong in Java, because I just learned this as I went today. Below are the two scripts, any advice would be greatly appreciated. I really would like to not give up on Java for this particular project.

PHP CODE

<?php
$bgtime=time();
$patterns =array(
    "/SOME REGEXP/",
    "/SOME REGEXP/",                    
    "/SOME REGEXP/",    
    "/SOME REGEXP/" 
);   

$fh = fopen('largeCSV.txt','r');
while($currentLineString = fgetcsv($fh, 10000, ","))
{
    foreach($patterns AS $pattern)
    {
        preg_match_all($pattern, $currentLineString[6], $matches);
    }
}
fclose($fh);
print "Execution Time: ".(time()-$bgtime);

?>

JAVA CODE

import au.com.bytecode.opencsv.CSVReader;
import java.io.FileReader;
import java.io.IOException;
import java.util.regex.Pattern;
import java.util.regex.Matcher;
import java.util.ArrayList;

public class testParser
{
    public static void main(String[] args)
    {
        long start = System.currentTimeMillis();


        String[] rawPatterns = {
                    "SOME REGEXP",
                    "SOME REGEXP",                    
                    "SOME REGEXP",    
                    "SOME REGEXP"    
        };

        ArrayList<Pattern> compiledPatternList = new ArrayList<Pattern>();        
        for(String patternString : rawPatterns)
        {
            Pattern compiledPattern = Pattern.compile(patternString);
            compiledPatternList.add(compiledPattern);
        }


        try{
            String fileName="largeCSV.txt";
            CSVReader reader = new CSVReader(new FileReader(fileName));

            String[] header = reader.readNext();
            String[] nextLine;
            String description;

            while( (nextLine = reader.readNext()) != null) 
            {
                description = nextLine[6];
                for(Pattern compiledPattern : compiledPatternList)
                {
                    Matcher m = compiledPattern.matcher(description);
                    while(m.find()) 
                    {
                        //System.out.println(m.group(0));
                    }                
                }
            }
        }

        catch(IOException ioe)
        {
            System.out.println("Blah!");
        }

        long end = System.currentTimeMillis();

        System.out.println("Execution time was "+((end-start)/1000)+" seconds.");
    }
}

856

asked Jul 11 '11 20:07

IOInterrupt

2 Answers

Using a buffered reader might help performance get quite a bit better:

CSVReader reader = new CSVReader(new BufferedReader(new FileReader(fileName)));

answered Oct 14 '22 20:10

rsp

I don't see anything glaringly wrong with your code. Try isolating the performance bottle-neck using a profiler. I find the netbeans profiler very user-friendly.

EDIT: Why speculate? Profile the app and get a detailed report of where the time is spent. Then work to resolve the inefficient areas. See http://profiler.netbeans.org/ for more information.

EDIT2: OK, I got bored and profiled this. My code is identical to yours and parsed a CSV file with 1,000 identical lines as follows:

SOME REGEXP,SOME REGEXP,SOME REGEXP,SOME REGEXP,SOME REGEXP,SOME REGEXP,SOME REGEXP,SOME REGEXP,SOME REGEXP,SOME REGEXP

Here are the results (obviously your results will differ as my regular expressions are trivial). However, it's plain to see that the regex processing is not your main area of concern.

enter image description here

Interestingly, if I apply a BufferedReader, the performance is enhanced by a whopping 18% (see below).

enter image description here

answered Oct 14 '22 20:10

hoipolloi

Related questions
                            
                                What is the best implementation for boolean in MySQL using Java to connect to the database?
                            
                                Tool to look for incompatabilities in method signatures / fields
                            
                                JSF 2.0 can't render dialog from primefaces
                            
                                JPA background cache refresh
                            
                                Shared cache between Tomcat web apps
                            
                                Matching the occurrence and pattern of characters of String2 in String1
                            
                                read byte array from C# that is written from Java
                            
                                Why do so many methods use Collection rather than Iterable?
                            
                                java.lang.NoClassDefFoundError: in anonymous inner class
                            
                                Accessing java Maps & Lists as JavaScript Objects in Rhino
                            
                                How to run java class file inside the war file [duplicate]
                            
                                how to handle / read the response with "Transfer-Encoding:chunked"?
                            
                                Example of inner classes used as an alternative to interfaces
                            
                                equals() method for classes with bidirectional association
                            
                                Obtain information about jboss
                            
                                Is there a Directed Acyclic Graph (DAG) data type in Java, and should I use it?
                            
                                Java Font Size vs HTML Font Size
                            
                                Generic method to Sort a Map on Values
                            
                                Adding a foreign server's self-signed certificate to the trusted certificates of my Tomcat
                            
                                Is enabling JConsole remote monitoring affect system performance in production?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With