Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

CSV Autodetection in Java

What would be a reliable way of autodetecting that a file is actually CSV, if CSV was redefined to mean "Character-Separated Values", i.e. data using any single character (but typically any non-alphanumeric symbol) as the delimiter and not only commas?

Essentially, with this (re)definition, CSV = DSV ("Delimiter-Separated Values"), discussed, for example, in this Wikipedia article, whereas the "Comma-Separated Values" format is defined in RFC 4180.

More specifically, is there a method for statistically deducting that the data is of somehow "fixed" length, meaning "possible CSV"? Just counting the number of delimiters does not always work, because there are CSV files with variable numbers of fields per record (i.e., records that, opposite to what RFC 4180 mandates, do not have the same number of fields across the same file).

CSV recognition seems to be a particularly challenging problem, especially if detection cannot based on the file extension (e.g., when reading a stream that does not have such information anyway).

Proper ("full") autodetection needs at least 4 decisions to be made reliably:

  1. Detecting that a file is actually CSV
  2. Detecting the presence of headers
  3. Detecting the actual separator character
  4. Detecting special characters (e.g., quotes)

Full autodetection seems to have no single solution, due to the similarities of other datasets (e.g., free text that uses commas), especially for corner cases like variable length records, single or double quoted fields, or multiline records.

So, the best approach seems to be telescopic detection, in which formats that can also be classified as CSV (e.g., log file formats like the Apache CLF) are examined before the application of the CSV detection rules.

Even commercial applications like Excel seem to rely on the file extension (.csv) in order to decide for (1), which is obviously no autodetection, although the problem is greatly simplified if the application is told that the data is CSV.

Here are some good relevant articles discussing heuristics for (2) and (3):

  • Autodetection of headers (Java)
  • Autodetection of separator (C#)
  • Autodetection of headers and separator (Python)

The detection of (4), the type of quotes, can be based on processing a few lines from the file and looking for corresponding values (e.g., an even number of ' or " per row would mean single or double quotes). Such processing can be done via initializing an existing CSV parser (e.g., OpenCSV) that will take proper care of CSV row separation (e.g., multiline events).

But what about (1), i.e., deciding that the data is CSV in the first place?

Could data mining help in this decision?

like image 847
PNS Avatar asked Dec 19 '11 19:12

PNS


People also ask

Can we format CSV file in Java?

No. There is no way to do any formatting. A CSV file consists of data only.

What is CSV parser in Java?

Class CSVParserParses CSV files according to the specified format. Because CSV appears in many different dialects, the parser supports many formats by allowing the specification of a CSVFormat . The parser works record wise. It is not possible to go back, once a record has been parsed from the input stream.


2 Answers

There are always going to be non-CSV files that look like CSV, and vice versa. For instance, there's the pathological (but perfectly valid) CSV file that frankc posted in the Java link you cited:

Name
Jim
Tom
Bill

The best one can do, I think, is some sort of heuristic estimate of the likelihood that a file is CSV. Some heuristics I can think of are:

  1. There is a candidate separator character that appears on every line (or, if you like, every line has one token).
  2. Given a candidate separator character, most (but not necessarily all) of the lines have the same number of fields.
  3. The presence of a first line that looks like it might be a header increases the likelihood of the file containing CSV data.

One can probably think up other heuristics. The approach would then be to develop a scoring algorithm based on these. The next step would be to score a collection of known CSV and non-CSV files. If there is a clear-enough separation, then the scoring could be deemed useful and the scores should tell you how to set a detection threshold.

like image 38
Ted Hopp Avatar answered Oct 13 '22 04:10

Ted Hopp


If you can't constrain whats used as a delimiter then you can use brute-force.

You could iterate through all possible combinations of quote character, column delimiter, and record delimiter (256 * 255 * 254 = 16581120 for ASCII).

id,text,date
1,"Bob says, ""hi
..."", with a sigh",1/1/2012

Remove all quoted columns, this can be done with a RegEx replace.

//quick javascript example of the regex, you'd replace the quote char with whichever character your currently testing
var test='id,text,date\n1,"bob, ""hi\n..."", sigh",1/1/2011';
console.log(test.replace(/"(""|.|\n|\r)*?"/gm,""));

id,text,date
1,,1/1/2012

Split on record delimiter

["id,text,date", "1,,1/1/2012"]

Split records on column delimiter

[ ["id", "text", "date"], ["1", "", "1/1/2012"] ]

If the number of columns per record match you have some CSV confidence.

3 == 3

If the number of columns don't match try another combination of row, column and quote character

EDIT

Actually parsing the data after you have confidence on the delimiters and checking for column type uniformity might be a useful extra step

  • Are all the columns in the first (header?) row strings
  • Does column X always parse out to null/empty or a valid (int, float, date)

The more CSV data (rows, columns) there is to work with, the more confidence you can extract from this method.

I think this question is kind of silly / overly general, if you have a stream of unknown data you'd definitely want to check for all of the "low hanging fruit" first. Binary formats usually have fairly distinct header signatures, then there's XML and JSON for easily detectable text formats.

like image 80
Louis Ricci Avatar answered Oct 13 '22 03:10

Louis Ricci