Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Good and effective CSV/TSV Reader for Java

I am trying to read big CSV and TSV (tab-separated) Files with about 1000000 rows or more. Now I tried to read a TSV containing ~2500000 lines with opencsv, but it throws me an java.lang.NullPointerException. It works with smaller TSV Files with ~250000 lines. So I was wondering if there are any other Libraries that support the reading of huge CSV and TSV Files. Do you have any ideas?

Everybody who is interested in my Code (I shorten it, so Try-Catch is obviously invalid):

InputStreamReader in = null;
CSVReader reader = null;
try {
    in = this.replaceBackSlashes();
    reader = new CSVReader(in, this.seperator, '\"', this.offset);
    ret = reader.readAll();
} finally {
    try {
        reader.close();
    } 
}

Edit: This is the Method where I construct the InputStreamReader:

private InputStreamReader replaceBackSlashes() throws Exception {
        FileInputStream fis = null;
        Scanner in = null;
        try {
            fis = new FileInputStream(this.csvFile);
            in = new Scanner(fis, this.encoding);
            ByteArrayOutputStream out = new ByteArrayOutputStream();

            while (in.hasNext()) {
                String nextLine = in.nextLine().replace("\\", "/");
                // nextLine = nextLine.replaceAll(" ", "");
                nextLine = nextLine.replaceAll("'", "");
                out.write(nextLine.getBytes());
                out.write("\n".getBytes());
            }

            return new InputStreamReader(new ByteArrayInputStream(out.toByteArray()));
        } catch (Exception e) {
            in.close();
            fis.close();
            this.logger.error("Problem at replaceBackSlashes", e);
        }
        throw new Exception();
    }
like image 527
Robin Avatar asked Dec 14 '12 13:12

Robin


People also ask

Which is better CSV or TSV?

CSV uses an escape syntax to represent commas and newlines in the data. TSV takes a different approach, disallowing TABs and newlines in the data. The escape syntax enables CSV to fully represent common written text. This is a good fit for human edited documents, notably spreadsheets.

What is TSV in Java?

Native and high performance TSV (Tab-Separated Values) creation programmatically using Java library.

How do I read a csv file in Java by line?

We can read a CSV file line by line using the readLine() method of BufferedReader class. Split each line on comma character to get the words of the line into an array. Now we can easily print the contents of the array by iterating over it or by using an appropriate index.


2 Answers

Do not use a CSV parser to parse TSV inputs. It will break if the TSV has fields with a quote character, for example.

uniVocity-parsers comes with a TSV parser. You can parse a billion rows without problems.

Example to parse a TSV input:

TsvParserSettings settings = new TsvParserSettings();
TsvParser parser = new TsvParser(settings);

// parses all rows in one go.
List<String[]> allRows = parser.parseAll(new FileReader(yourFile));

If your input is so big it can't be kept in memory, do this:

TsvParserSettings settings = new TsvParserSettings();

// all rows parsed from your input will be sent to this processor
ObjectRowProcessor rowProcessor = new ObjectRowProcessor() {
    @Override
    public void rowProcessed(Object[] row, ParsingContext context) {
        //here is the row. Let's just print it.
        System.out.println(Arrays.toString(row));
    }
};
// the ObjectRowProcessor supports conversions from String to whatever you need:
// converts values in columns 2 and 5 to BigDecimal
rowProcessor.convertIndexes(Conversions.toBigDecimal()).set(2, 5);

// converts the values in columns "Description" and "Model". Applies trim and to lowercase to the values in these columns.
rowProcessor.convertFields(Conversions.trim(), Conversions.toLowerCase()).set("Description", "Model");

//configures to use the RowProcessor
settings.setRowProcessor(rowProcessor);

TsvParser parser = new TsvParser(settings);
//parses everything. All rows will be pumped into your RowProcessor.
parser.parse(new FileReader(yourFile));

Disclosure: I am the author of this library. It's open-source and free (Apache V2.0 license).

like image 188
Jeronimo Backes Avatar answered Oct 04 '22 09:10

Jeronimo Backes


I have not tried it, but I had investigated superCSV earlier.

http://sourceforge.net/projects/supercsv/

http://supercsv.sourceforge.net/

Check if that works for you, 2.5 million lines.

like image 39
RuntimeException Avatar answered Oct 04 '22 10:10

RuntimeException