Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Scala: Auto detection of delimiter/separator in CSV file

I'm using OpenCSV library for split my CSV files. Now i need to detect the delimiter/separator character with absolute certainty. I have searched on the net but I only found examples where you create a list of candidates and try one of these. I do not think that is the best way because you are likely to get errors. My splitter should work properly on any CSV (of which I have no control) so it has to be as generic as possible. Does anyone have a good solution?

like image 602
YoBre Avatar asked May 20 '14 11:05

YoBre


1 Answers

You may have already seen this related SO question, which lists good strategies, like counting the number of times a potential delimiter appears, and/or verifying that each row has the same number of columns when using a hypothetical delimiter.

Unfortunately, absolute certainty is impossible because the format doesn't include a way to specify the delimiter unambiguously within the file. I think the the best solution for making it as generic as possible would be to make the user specify when it isn't a comma (which is how opencsv handles it), or perhaps allow a client to specify the delimiter if you or they determine that automatic detection failed. If this can't be interactive, then I think the best you can do is log the cases where you think it failed so that they can deal with it later.

Also, I think the error rate will be lower than you're expecting. My guess is that 99% of the time the delimiter will be a comma, semicolon, period, or tab. I've unfortunately seen lazy coders use something like a caret, pipe, or tilde to delimit fields under the assumption that the data won't contain it, so they won't have to do proper escaping. But this isn't the norm, and it shouldn't be considered CSV.

The Python csv module has a Sniffer class which guesses delimiters (the user supplies a list of candidates); you may want to look at its implementation.

like image 124
johncip Avatar answered Sep 28 '22 04:09

johncip