Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Reading a csv file with millions of row via java as fast as possible

Tags:

java

csv

I want to read a csv files including millions of rows and use the attributes for my decision Tree algorithm. My code is below:

String csvFile = "myfile.csv";
List<String[]> rowList = new ArrayList();
String line = "";
String cvsSplitBy = ",";
String encoding = "UTF-8";
BufferedReader br2 = null;
try {
    int counterRow = 0;
    br2 =  new BufferedReader(new InputStreamReader(new FileInputStream(csvFile), encoding));
    while ((line = br2.readLine()) != null) { 
        line=line.replaceAll(",,", ",NA,");
        String[] object = line.split(cvsSplitBy);
        rowList.add(object); 
        counterRow++;
    }
    System.out.println("counterRow is: "+counterRow);
    for(int i=1;i<rowList.size();i++){
        try{
           //this method includes many if elses only.
           ImplementDecisionTreeRulesFor2012(rowList.get(i)[0],rowList.get(i)[1],rowList.get(i)[2],rowList.get(i)[3],rowList.get(i)[4],rowList.get(i)[5],rowList.get(i)[6]); 
        }
        catch(Exception ex){
           System.out.printlnt("Exception occurred");   
        }
    }
}
catch(Exception ex){
    System.out.println("fix"+ex);
}

It is working fine when the size of the csv file is not large. However, it is large indeed. Therefore I need another way to read a csv faster. Is there any advice? Appreciated, thanks.

like image 297
Joe Leffrey Avatar asked Mar 31 '16 18:03

Joe Leffrey


People also ask

Is there a limit to the number of rows in a CSV file?

csv files have a limit of 32,767 characters per cell. Excel has a limit of 1,048,576 rows and 16,384 columns per sheet. CSV files can hold many more rows. You can read more about these limits and others from this Microsoft support article here.


3 Answers

Just use uniVocity-parsers' CSV parser instead of trying to build your custom parser. Your implementation will probably not be fast or flexible enough to handle all corner cases.

It is extremely memory efficient and you can parse a million rows in less than a second. This link has a performance comparison of many java CSV libraries and univocity-parsers comes on top.

Here's a simple example of how to use it:

CsvParserSettings settings = new CsvParserSettings(); // you'll find many options here, check the tutorial.
CsvParser parser = new CsvParser(settings);

// parses all rows in one go (you should probably use a RowProcessor or iterate row by row if there are many rows)
List<String[]> allRows = parser.parseAll(new File("/path/to/your.csv"));

BUT, that loads everything into memory. To stream all rows, you can do this:

String[] row;
parser.beginParsing(csvFile)
while ((row = parser.parseNext()) != null) {
    //process row here.
}

The faster approach is to use a RowProcessor, it also gives more flexibility:

settings.setRowProcessor(myChosenRowProcessor);
CsvParser parser = new CsvParser(settings);
parser.parse(csvFile);

Lastly, it has built-in routines that use the parser to perform some common tasks (iterate java beans, dump ResultSets, etc)

This should cover the basics, check the documentation to find the best approach for your case.

Disclosure: I am the author of this library. It's open-source and free (Apache V2.0 license).

like image 72
Jeronimo Backes Avatar answered Sep 23 '22 17:09

Jeronimo Backes


In this snippet I see two issues which will slow you down considerably:

while ((line = br2.readLine()) != null) { 
    line=line.replaceAll(",,", ",NA,");
    String[] object = line.split(cvsSplitBy);
    rowList.add(object); 
    counterRow++;
}

First, rowList starts with the default capacity and will have to be increased many times, always causing a copy of the old underlying array to the new.

Worse, however, ist the excessive blow-up of the data into a String[] object. You'll need the columns/cells only when you call ImplementDecisionTreeRulesFor2012 for that row - not all the time while you read that file and process all the other rows. Move the split (or something better, as suggested by comments) to the second row.

(Creating many objects is bad, even if you can afford the memory.)

Perhaps it would be better to call ImplementDecisionTreeRulesFor2012 while you read the "millions"? It would avoid the rowList ArrayList altogether.

Later Postponing the split reduces the execution time for 10 million rows from 1m8.262s (when the program ran out of heap space) to 13.067s.

If you aren't forced to read all rows before you can call Implp...2012, the time reduces to 4.902s.

Finally writing the split and replace by hand:

String[] object = new String[7];
//...read...
    String x = line + ",";
    int iPos = 0;
    int iStr = 0; 
    int iNext = -1;
    while( (iNext = x.indexOf( ',', iPos )) != -1 && iStr < 7 ){
        if( iNext == iPos ){
            object[iStr++] = "NA";
        } else {
             object[iStr++] = x.substring( iPos, iNext );
        }
        iPos = iNext + 1;
    }
    // add more "NA" if rows can have less than 7 cells

reduces the time to 1.983s. This is about 30 times faster than the original code, which runs into OutOfMemory anyway.

like image 36
laune Avatar answered Sep 20 '22 17:09

laune


on top of the aforementioned univocity it's worth checking

  • https://github.com/FasterXML/jackson-dataformat-csv
  • http://simpleflatmapper.org/0101-getting-started-csv.html, it also have a low level api that by pass the String creation.

the 3 of them would as the time of the comment the fastest csv parser.

Chance is that writting your own parser would be slower and buggy.

like image 22
user3996996 Avatar answered Sep 24 '22 17:09

user3996996