CSV Autodetection in Java

Tags:

What would be a reliable way of autodetecting that a file is actually CSV, if CSV was redefined to mean "Character-Separated Values", i.e. data using any single character (but typically any non-alphanumeric symbol) as the delimiter and not only commas?

Essentially, with this (re)definition, CSV = DSV ("Delimiter-Separated Values"), discussed, for example, in this Wikipedia article, whereas the "Comma-Separated Values" format is defined in RFC 4180.

More specifically, is there a method for statistically deducting that the data is of somehow "fixed" length, meaning "possible CSV"? Just counting the number of delimiters does not always work, because there are CSV files with variable numbers of fields per record (i.e., records that, opposite to what RFC 4180 mandates, do not have the same number of fields across the same file).

CSV recognition seems to be a particularly challenging problem, especially if detection cannot based on the file extension (e.g., when reading a stream that does not have such information anyway).

Proper ("full") autodetection needs at least 4 decisions to be made reliably:

Detecting that a file is actually CSV
Detecting the presence of headers
Detecting the actual separator character
Detecting special characters (e.g., quotes)

Full autodetection seems to have no single solution, due to the similarities of other datasets (e.g., free text that uses commas), especially for corner cases like variable length records, single or double quoted fields, or multiline records.

So, the best approach seems to be telescopic detection, in which formats that can also be classified as CSV (e.g., log file formats like the Apache CLF) are examined before the application of the CSV detection rules.

Even commercial applications like Excel seem to rely on the file extension (.csv) in order to decide for (1), which is obviously no autodetection, although the problem is greatly simplified if the application is told that the data is CSV.

Here are some good relevant articles discussing heuristics for (2) and (3):

Autodetection of headers (Java)
Autodetection of separator (C#)
Autodetection of headers and separator (Python)

The detection of (4), the type of quotes, can be based on processing a few lines from the file and looking for corresponding values (e.g., an even number of ' or " per row would mean single or double quotes). Such processing can be done via initializing an existing CSV parser (e.g., OpenCSV) that will take proper care of CSV row separation (e.g., multiline events).

But what about (1), i.e., deciding that the data is CSV in the first place?

Could data mining help in this decision?

847

asked Dec 19 '11 19:12

PNS

2 Answers

There are always going to be non-CSV files that look like CSV, and vice versa. For instance, there's the pathological (but perfectly valid) CSV file that frankc posted in the Java link you cited:

Name
Jim
Tom
Bill

The best one can do, I think, is some sort of heuristic estimate of the likelihood that a file is CSV. Some heuristics I can think of are:

There is a candidate separator character that appears on every line (or, if you like, every line has one token).
Given a candidate separator character, most (but not necessarily all) of the lines have the same number of fields.
The presence of a first line that looks like it might be a header increases the likelihood of the file containing CSV data.

One can probably think up other heuristics. The approach would then be to develop a scoring algorithm based on these. The next step would be to score a collection of known CSV and non-CSV files. If there is a clear-enough separation, then the scoring could be deemed useful and the scores should tell you how to set a detection threshold.

answered Oct 13 '22 04:10

Ted Hopp

If you can't constrain whats used as a delimiter then you can use brute-force.

You could iterate through all possible combinations of quote character, column delimiter, and record delimiter (256 * 255 * 254 = 16581120 for ASCII).

id,text,date
1,"Bob says, ""hi
..."", with a sigh",1/1/2012

Remove all quoted columns, this can be done with a RegEx replace.

//quick javascript example of the regex, you'd replace the quote char with whichever character your currently testing
var test='id,text,date\n1,"bob, ""hi\n..."", sigh",1/1/2011';
console.log(test.replace(/"(""|.|\n|\r)*?"/gm,""));

id,text,date
1,,1/1/2012

Split on record delimiter

["id,text,date", "1,,1/1/2012"]

Split records on column delimiter

[ ["id", "text", "date"], ["1", "", "1/1/2012"] ]

If the number of columns per record match you have some CSV confidence.

3 == 3

If the number of columns don't match try another combination of row, column and quote character

EDIT

Actually parsing the data after you have confidence on the delimiters and checking for column type uniformity might be a useful extra step

Are all the columns in the first (header?) row strings
Does column X always parse out to null/empty or a valid (int, float, date)

The more CSV data (rows, columns) there is to work with, the more confidence you can extract from this method.

I think this question is kind of silly / overly general, if you have a stream of unknown data you'd definitely want to check for all of the "low hanging fruit" first. Binary formats usually have fairly distinct header signatures, then there's XML and JSON for easily detectable text formats.

answered Oct 13 '22 03:10

Louis Ricci

Related questions
                            
                                Highlight current row in JTextPane
                            
                                How to read .class file? [duplicate]
                            
                                Apache HttpClient 4.1.1 NTLM authentication not SPNEGO
                            
                                Making each test method run in its own instance of a test class with TestNG?
                            
                                Lucene IndexWriter thread safety
                            
                                Java AWT: Is Font a lightweight object?
                            
                                Recognizing handwritten shapes
                            
                                Define cipher suite for TLS in JCA
                            
                                How to set Http Proxy in an applet
                            
                                Running a java applet from netbeans?
                            
                                Serialized object size vs in memory object size in Java
                            
                                Abstract classes and Spring MVC @ModelAttribute/@RequestParam
                            
                                Java-Based Regression Testing [closed]
                            
                                How do I serialize a Java object such that it can be deserialized by pickle (Python)?
                            
                                Duplicated field in generated XML using JAXB
                            
                                JAXB default attribute value
                            
                                Java : DataInputStream replacement for endianness
                            
                                Amazon Marketplace app rejected for containing Eclipse settings file
                            
                                Detect exception in AutoCloseable close()
                            
                                How can I make JUnit 4.8 run code after a failed test, but before any @After methods?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

CSV Autodetection in Java

Tags:

java

csv

data-mining

autodiscovery

PNS

People also ask

2 Answers

Ted Hopp

Louis Ricci

Recent Activity

Donate For Us