Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Fastest way of reading a CSV file in Java

I have noticed that using java.util.Scanner is very slow when reading large files (in my case, CSV files).

I want to change the way I am currently reading files, to improve performance. Below is what I have at the moment. Note that I am developing for Android:

InputStreamReader inputStreamReader;
    try {
        inputStreamReader = new InputStreamReader(context.getAssets().open("MyFile.csv"));
        Scanner inputStream = new Scanner(inputStreamReader);
        inputStream.nextLine(); // Ignores the first line
        while (inputStream.hasNext()) {
            String data = inputStream.nextLine(); // Gets a whole line
            String[] line = data.split(","); // Splits the line up into a string array

            if (line.length > 1) {
                // Do stuff, e.g:
                String value = line[1];
            }
        }
        inputStream.close();
    } catch (IOException e) {
        e.printStackTrace();
    }

Using Traceview, I managed to find that the main performance issues, specifically are: java.util.Scanner.nextLine() and java.util.Scanner.hasNext().

I've looked at other questions (such as this one), and I've come across some CSV readers, like the Apache Commons CSV, but they don't seem to have much information on how to use them, and I'm not sure how much faster they would be.

I have also heard about using FileReader and BufferedReader in answers like this one, but again, I do not know whether the improvements will be significant.

My file is about 30,000 lines in length, and using the code I have at the moment (above), it takes at least 1 minute to read values from about 600 lines down, so I have not timed how long it would take to read values from over 2,000 lines down, but sometimes, when reading information, the Android app becomes unresponsive and crashes.

Although I could simply change parts of my code and see for myself, I would like to know if there are any faster alternatives I have not mentioned, or if I should just use FileReader and BufferedReader. Would it be faster to split the huge file into smaller files, and choose which one to read depending on what information I want to retrieve? Preferably, I would also like to know why the fastest method is the fastest (i.e. what makes it fast).

like image 974
Farbod Salamat-Zadeh Avatar asked Jun 26 '15 20:06

Farbod Salamat-Zadeh


2 Answers

uniVocity-parsers has the fastest CSV parser you'll find (2x faster than OpenCSV, 3x faster than Apache Commons CSV), with many unique features.

Here's a simple example on how to use it:

CsvParserSettings settings = new CsvParserSettings(); // many options here, have a look at the tutorial

CsvParser parser = new CsvParser(settings);

// parses all rows in one go
List<String[]> allRows = parser.parseAll(new FileReader(new File("your/file.csv")));

To make the process faster, you can select the columns you are interested in:

parserSettings.selectFields("Column X", "Column A", "Column Y");

Normally, you should be able to parse 4 million rows around 2 seconds. With column selection the speed will improve by roughly 30%.

It is even faster if you use a RowProcessor. There are many implementations out-of-the box for processing conversions to objects, POJOS, etc. The documentation explains all of the available features. It works like this:

// let's get the values of all columns using a column processor
ColumnProcessor rowProcessor = new ColumnProcessor();
parserSettings.setRowProcessor(rowProcessor);

//the parse() method will submit all rows to the row processor
parser.parse(new FileReader(new File("/examples/example.csv")));

//get the result from your row processor:
Map<String, List<String>> columnValues = rowProcessor.getColumnValuesAsMapOfNames();

We also built a simple speed comparison project here.

like image 165
Jeronimo Backes Avatar answered Sep 21 '22 20:09

Jeronimo Backes


Your code is good to load big files. However, when an operation is going to be longer than you're expecting, it's good practice to execute it in a task and not in UI Thread, in order to prevent any lack of responsiveness.

The AsyncTask class help to do that:

private class LoadFilesTask extends AsyncTask<String, Integer, Long> {
    protected Long doInBackground(String... str) {
        long lineNumber = 0;
        InputStreamReader inputStreamReader;
        try {
            inputStreamReader = new
                    InputStreamReader(context.getAssets().open(str[0]));
            Scanner inputStream = new Scanner(inputStreamReader);
            inputStream.nextLine(); // Ignores the first line

            while (inputStream.hasNext()) {
                lineNumber++;
                String data = inputStream.nextLine(); // Gets a whole line
                String[] line = data.split(","); // Splits the line up into a string array

                if (line.length > 1) {
                    // Do stuff, e.g:
                    String value = line[1];
                }
            }
            inputStream.close();
        } catch (IOException e) {
            e.printStackTrace();
        }
        return lineNumber;
    }

    //If you need to show the progress use this method
    protected void onProgressUpdate(Integer... progress) {
        setYourCustomProgressPercent(progress[0]);
    }

    //This method is triggered at the end of the process, in your case when the loading has finished
    protected void onPostExecute(Long result) {
        showDialog("File Loaded: " + result + " lines");
    }
}

...and executing as:

new LoadFilesTask().execute("MyFile.csv");
like image 35
Ciro Rizzo Avatar answered Sep 25 '22 20:09

Ciro Rizzo