Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Using CsvBeanReader to read a CSV file with a variable number of columns

Tags:

java

csv

supercsv

So I'm working on parsing a .csv file. I took the advice of another thread somewhere on StackOverflow and downloaded SuperCSV. I finally got pretty much everything working, but now I've run into a bug that seems difficult to fix.

The problem occurs because the last two columns of data may or may not be populated. Here is an example of a .csv file with the first row missing the last column, and the second row entirely complete:

2012:07:25,11:48:20,922,"uLog.exe","",Key pressed,1246,341,-1.00,-1.00,1.00,Shift 2012:07:25,11:48:21,094,"uLog.exe","",Key pressed,1246,341,-1.00,-1.00,1.00,b,Shift

From my understanding of the Super CSV Javadoc, there is no way to populate a Java Bean with the CsvBeanReader if there are a variable number of columns. This seems really dumb because I feel like these missing columns should be allowed to be null or some other default value when the Bean is initialized.

For reference, here is my complete code for the parser:

public class ULogParser {

String uLogFileLocation;
String screenRecorderFileLocation;

private static final CellProcessor[] cellProcessor = new CellProcessor[] {
    new ParseDate("yyyy:MM:dd"),
    new ParseDate("HH:mm:ss"),
    new ParseDate("SSS"),
    new StrMinMax(0, 100),
    new StrMinMax(0, 100),
    new StrMinMax(0, 100),
    new ParseInt(),
    new ParseInt(),
    new ParseDouble(),
    new ParseDouble(),
    new ParseDouble(),
    new StrMinMax(0, 100),
    new StrMinMax(0, 100),
};

public String[] header = {"Date", "Time", "Msec", "Application", "Window", "Message", "X", "Y", "RelDist", "TotalDist", "Rate", "Extra1", "Extra2"}; 

public ULogParser(String uLogFileLocation, String screenRecorderFileLocation)
{
    this.uLogFileLocation = uLogFileLocation;
    this.screenRecorderFileLocation = screenRecorderFileLocation;
}

public void parse()
{
    try {
        ICsvBeanReader reader = new CsvBeanReader(new BufferedReader(new FileReader(uLogFileLocation)), CsvPreference.STANDARD_PREFERENCE);
        reader.getCSVHeader(false); //parse past the header
        Entry entry;
        entry = reader.read(Entry.class, header, cellProcessor);
        System.out.println(entry.Application);
    } catch (FileNotFoundException e) {
        // TODO Auto-generated catch block
        e.printStackTrace();
    } catch (IOException e) {
        // TODO Auto-generated catch block
        e.printStackTrace();
    }
}

public void sendToDB()
{
    Query query = new Query();
}
}

And the code for the Entry class:

public class Entry
{
private Date Date;
private Date Time;
private Date Msec;
private String Application;
private String Window;
private String Message;
private int X;
private int Y;
private double RelDist;
private double TotalDist;
private double Rate;
private String Extra1;
private String Extra2;

public Date getDate() { return Date; }
public Date getTime() { return Time; }
public Date getMsec() { return Msec; }
public String getApplication() { return Application; }
public String getWindow() { return Window; }
public String getMessage() { return Message; }
public int getX() { return X; }
public int getY() { return Y; }
public double getRelDist() { return RelDist; }
public double getTotalDist() { return TotalDist; }
public double getRate() { return Rate; }
public String getExtra1() { return Extra1; }
public String getExtra2() { return Extra2; }

public void setDate(Date Date) { this.Date = Date; }
public void setTime(Date Time) { this.Time = Time; }
public void setMsec(Date Msec) { this.Msec = Msec; }
public void setApplication(String Application) { this.Application = Application; }
public void setWindow(String Window) { this.Window = Window; }
public void setMessage(String Message) { this.Message = Message; }
public void setX(int X) { this.X = X; }
public void setY(int Y) { this.Y = Y; }
public void setRelDist(double RelDist) { this.RelDist = RelDist; }
public void setTotalDist(double TotalDist) { this.TotalDist = TotalDist; }
public void setRate(double Rate) { this.Rate = Rate; }
public void setExtra1(String Extra1) { this.Extra1 = Extra1; }
public void setExtra2(String Extra2) { this.Extra2 = Extra2; }

public Entry(){}
}

And the exception I'm receiving (note this is a different line than my above example, missing both of the last two columns):

Exception in thread "main" The value array (size 12)  must match the processors array (size 13): You are probably reading a CSV line with a different number of columns than the number of cellprocessors specified context: Line: 2 Column: 0 Raw line:
[2012:07:25, 11:48:05, 740, uLog.exe,  , Logging started, -1, -1, -1.00, -1.00, -1.00, ]
 offending processor: null
    at org.supercsv.util.Util.processStringList(Unknown Source)
    at org.supercsv.io.CsvBeanReader.read(Unknown Source)
    at processing.ULogParser.parse(ULogParser.java:59)
    at ui.ParseImplicitData.main(ParseImplicitData.java:15)

Yes, writing all those getters and setters was a pain in the ass. Also, I apologize, I probably don't have perfect convention in my use of SuperCSV (like what CellProcessor to use if you just want the unmodified String), but you get the idea. Also, this code is obviously not complete. For now, I'm just trying to successfully retrieve a line of data.

At this point, I'm wondering if using the CsvBeanReader is possible for my purposes. If not, I'm a little disappointed, since the CsvListReader (I would post hyperlink, but StackOverflow isn't allowing me too, also dumb) is just about as easy as not using the API at all, and just using Scanner.next().

Any help would be appreciated. Thanks in advance!

like image 652
Bryce Sandlund Avatar asked Jul 26 '12 21:07

Bryce Sandlund


3 Answers

Edit: Update for Super CSV 2.0.0-beta-1

Please note the API has changed in Super CSV 2.0.0-beta-1 (the code example is based on 1.52). The getCSVHeader() method on all readers is now getHeader() (to be in line with writeHeader on the writers).

Also, SuperCSVException has been renamed to SuperCsvException.


Edit: Update for Super CSV 2.1.0

Since version 2.1.0 it's possible to execute the cell processors after reading a line of CSV by using the new executeProcessors() method. For more information see this example on the project website. Please note this is only relevant for CsvListReader, as it's the only reader that allows for variable column length.


You're correct - CsvBeanReader doesn't support CSV files with a variable number of columns. According to most CSV specifications (including RFC 4180), the number of columns must be the same on every row.

For this reason (as a Super CSV developer) I'm reluctant to add this functionality to Super CSV. If you can think of an elegant way to add it then feel free to make suggestions on the project's SourceForge site. It would probably mean a new reader that extends upon CsvBeanReader: it would have to split the reading and mapping/processing into two separate methods (you can't do any processing or mapping to fields of the bean unless you know how many columns there are).

Simple solution

The simple solution to this (if you have control of the CSV file you're working with) is to simply add a blank column when writing your CSV file (the first line in your example would have a comma at the end - to indicate the last column is empty). That way, your CSV file will be valid (it will have the same number of columns on every row) and you can use CsvBeanReader as you're already doing.

If that's not possible, then all is not lost!

Fancy solution

As you probably realize, CsvBeanReader uses the name mapping to associate each column in the CSV file with a field in your bean, and the CellProcessor array to process each column. In other words, you have to know how many columns there are (and what they represent) if you want to use it.

CsvListReader, on the other hand, is very primitive and can read rows of varying length (because it doesn't need to process or map them).

So you can combine all the features of CsvBeanReader with CsvListReader (as done in the following example) by reading the file with both readers in parallel: using CsvListReader to figure out how many columns there are, and CsvBeanReader to do the processing/mapping.

Note that this makes the assumption that it's only ever the birthDate column that may not be present (i.e. it wouldn't work if you can't tell which column is missing).

package example;

import java.io.StringReader;
import java.util.Date;

import org.supercsv.cellprocessor.ParseDate;
import org.supercsv.cellprocessor.ift.CellProcessor;
import org.supercsv.exception.SuperCSVException;
import org.supercsv.io.CsvBeanReader;
import org.supercsv.io.CsvListReader;
import org.supercsv.io.ICsvBeanReader;
import org.supercsv.io.ICsvListReader;
import org.supercsv.prefs.CsvPreference;

public class VariableColumns {

    private static final String INPUT = "name,birthDate,city\n"
        + "John,New York\n" 
        + "Sally,22/03/1974,London\n" 
        + "Jim,Sydney";

    // cell processors
    private static final CellProcessor[] NORMAL_PROCESSORS = 
    new CellProcessor[] {null, new ParseDate("dd/MM/yyyy"), null };
    private static final CellProcessor[] NO_BIRTHDATE_PROCESSORS = 
    new CellProcessor[] {null, null };

    // name mappings
    private static final String[] NORMAL_HEADER = 
    new String[] { "name", "birthDate", "city" };
    private static final String[] NO_BIRTHDATE_HEADER = 
    new String[] { "name", "city" };

    public static void main(String[] args) {

        // using bean reader and list reader together (to read the same file)
        final ICsvBeanReader beanReader = new CsvBeanReader(new StringReader(
                INPUT), CsvPreference.STANDARD_PREFERENCE);
        final ICsvListReader listReader = new CsvListReader(new StringReader(
                INPUT), CsvPreference.STANDARD_PREFERENCE);

        try {
            // skip over header
            beanReader.getCSVHeader(true);
            listReader.getCSVHeader(true);

            while (listReader.read() != null) {

                final String[] nameMapping;
                final CellProcessor[] processors;

                if (listReader.length() == NORMAL_HEADER.length) {
                    // all columns present - use normal header/processors
                    nameMapping = NORMAL_HEADER;
                    processors = NORMAL_PROCESSORS;

                } else if (listReader.length() == NO_BIRTHDATE_HEADER.length) {
                    // one less column - birth date must be missing
                    nameMapping = NO_BIRTHDATE_HEADER;
                    processors = NO_BIRTHDATE_PROCESSORS;

                } else {
                    throw new SuperCSVException(
                            "unexpected number of columns: "
                                    + listReader.length());
                }

                // can now use CsvBeanReader safely 
                // (we know how many columns there are)
                Person person = beanReader.read(Person.class, nameMapping,
                        processors);

                System.out.println(String.format(
                        "Person: name=%s, birthDate=%s, city=%s",
                        person.getName(), person.getBirthDate(),
                        person.getCity()));

            }
        } catch (Exception e) {
            // handle exceptions here
            e.printStackTrace();
        } finally {
            // close readers here
        }
    }

    public static class Person {

        private String name;
        private Date birthDate;
        private String city;

        public String getName() {
            return name;
        }

        public void setName(String name) {
            this.name = name;
        }

        public Date getBirthDate() {
            return birthDate;
        }

        public void setBirthDate(Date birthDate) {
            this.birthDate = birthDate;
        }

        public String getCity() {
            return city;
        }

        public void setCity(String city) {
            this.city = city;
        }
    }

}

I hope this helps.

Oh, and is there any reason why the fields in your Entry class don't follow normal naming conventions (camelCase)? If you update your header array to use camelcase, then your fields can be camelcase as well.

like image 166
James Bassett Avatar answered Nov 09 '22 05:11

James Bassett


Well, SuperCSV is Open Source. If you want to add functionality, such as handling input with a variable number of trailing fields, you have basically two options:

  1. Post a support request on the SourceForge site and hope the author agrees and has time to do it
  2. Download the source, change it to your liking, and contribute the changes to the project.

This is how Open Source works.

like image 32
Jim Garrison Avatar answered Nov 09 '22 04:11

Jim Garrison


Using uniVocity-parsers you can map CSV files with a varying number of columns to java beans. Using annotations:

class TestBean {

// if the value parsed in the quantity column is "?" or "-", it will be replaced by null.
@NullString(nulls = { "?", "-" })
// if a value resolves to null, it will be converted to the String "0".
@Parsed(defaultNullRead = "0")
private Integer quantity;   // The attribute type defines which conversion will be executed when processing the value.
// In this case, IntegerConversion will be used.
// The attribute name will be matched against the column header in the file automatically.

@Trim
@LowerCase
// the value for the comments attribute is in the column at index 4 (0 is the first column, so this means fifth column in the file)
@Parsed(index = 4)
private String comments;

// you can also explicitly give the name of a column in the file.
@Parsed(field = "amount")
private BigDecimal amount;

@Trim
@LowerCase
// values "no", "n" and "null" will be converted to false; values "yes" and "y" will be converted to true
@BooleanString(falseStrings = { "no", "n", "null" }, trueStrings = { "yes", "y" })
@Parsed
private Boolean pending;
...
}

To parse your CSV into a list of TestBean instances:

// BeanListProcessor converts each parsed row to an instance of a given class, then stores each instance into a list.
BeanListProcessor<TestBean> rowProcessor = new BeanListProcessor<TestBean>(TestBean.class);
CsvParserSettings parserSettings = new CsvParserSettings();
parserSettings.setRowProcessor(rowProcessor);
//Uses the first valid row of the CSV to assign names to each column
parserSettings.setHeaderExtractionEnabled(true);

CsvParser parser = new CsvParser(parserSettings);
parser.parse(new FileReader(yourFile));

// The BeanListProcessor provides a list of objects extracted from the input.
List<TestBean> beans = rowProcessor.getBeans();

Disclosure: I am the author of this library. It's open-source and free (Apache V2.0 license).

like image 1
Jeronimo Backes Avatar answered Nov 09 '22 05:11

Jeronimo Backes