Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Error Parsing due to CSV Differences Before/After Saving (Java w/ Apache Commons CSV)

I have a 37 column CSV file that I am parsing in Java with Apache Commons CSV 1.2. My setup code is as follows:

//initialize FileReader object
FileReader fileReader = new FileReader(file);

//intialize CSVFormat object
CSVFormat csvFileFormat = CSVFormat.DEFAULT.withHeader(FILE_HEADER_MAPPING);

//initialize CSVParser object
CSVParser csvFileParser = new CSVParser(fileReader, csvFileFormat);

//Get a list of CSV file records
List<CSVRecord> csvRecords = csvFileParser.getRecords();

// process accordingly

My problem is that when I copy the CSV to be processed to my target directory and run my parsing program, I get the following error:

Exception in thread "main" java.lang.IllegalArgumentException: Index for header 'Title' is 7 but CSVRecord only has 6 values!
        at org.apache.commons.csv.CSVRecord.get(CSVRecord.java:110)
        at launcher.QualysImport.createQualysRecords(Unknown Source)
        at launcher.QualysImport.importQualysRecords(Unknown Source)
        at launcher.Main.main(Unknown Source)

However, if I copy the file to my target directory, open and save it, then try the program again, it works. Opening and saving the CSV adds back the commas needed at the end so my program won't compain about not having enough headers to read.

For context, here is a sample line of before/after saving:

Before (failing): "data","data","data","data"

After (working): "data","data",,,,"data",,,"data",,,,,,

So my question: why does the CSV format change when I open and save it? I'm not changing any values or encoding, and the behavior is the same for MS-DOS or regular .csv format when saving. Also, I'm using Excel to copy/open/save in my testing.

Is there some encoding or format setting I need to be using? Can I solve this programmatically?

Thanks in advance!

EDIT #1:

For additional context, when I first view an empty line in the original file, it just has the new line ^M character like this:

^M

After opening in Excel and saving, it looks like this with all 37 of my empty fields:

,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,^M

Is this a Windows encoding discrepancy?

like image 318
corneria Avatar asked Apr 15 '16 17:04

corneria


People also ask

How to read a CSV file with header in Java with example?

CSV Parser to read CSV files in Java and to read a normal CSV file with header you need to write : Reader in = ...; Iterable parser = CSVFormat. DEFAULT. parse(in); for (CSVRecord record : parser) { ... }

What is a CSV parsing error?

What It Means. An error message that begins “Failed to parse file” indicates that the uploaded CSV file is invalid in some way. Watershed supports UTF-8 comma separated files, which can use quotes to enclose text. Any quotes used in fields must be escaped with an additional double quote.


1 Answers

Maybe that's a compatibility issue with whatever generated the file in the first place. It seems that Excel accepts a blank line as a valid row with empty strings in each column, with the number of columns to match some other row(s). Then it saves it according to CSV conventions with the column delimiter. (the ^M is the Carriage Return character; on Microsoft systems it precedes the Line Feed character at the end of a line in text files)

Perhaps you can deal with it by creating your own Reader subclass to sit between the FileReader and the CSVParser. Your reader will read a line, and if it is blank then return a line with the correct number of commas. Otherwise just return the line as-is.

For example:

class MyCSVCompatibilityReader extends BufferedReader
    {
    private final BufferedReader delegate;

    public MyCSVCompatibilityReader(final FileReader fileReader)
        {
        this.delegate = new BufferedReader(fileReader);
        }

    @Override
    public String readLine()
        {
        final String line = this.delegate.readLine();
        if ("".equals(line.trim())
            { return ",,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,"; }
        else
            { return line; }
        }
    }

There are a lot of other details to implement correctly when implementing the interface. You'll need to pass through calls to all the other methods (close, ready, reset, skip, etc.), and ensure that each of the various read() methods work correctly. It might be easier, if the file will fit in memory easily, to just read the file and write the fixed version to a new StringWriter then create a StringReader to the CSVParser.

like image 71
dsh Avatar answered Sep 30 '22 17:09

dsh