Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to programmatically guess whether a CSV file is comma or semicolon delimited

In most cases, CSV files are text files with records delimited by commas. However, sometimes these files will come semicolon delimited. (Excel will use semicolon delimiters when saving CSVs if the regional settings has the decimal separator set as the comma -- this is common in Europe. Ref: http://en.wikipedia.org/wiki/Comma-separated_values#Application_support)

My question is, what is the best way to have a program guess whether to have it comma or semicolon separated?

e.g. a line like 1,1;1,1 may be ambiguous. It could be interpreted comma delimited as: 1 1;1 (a string) 1

or semicolon delimited as 1,1 1,1

My best guess so far is to try parsing the file both with , and ; delimiters, then choose the parse that has the most rows of the same length as the first row (usually a header row). If both have the same number of rows, choose the one with more columns. The main disadvantage of this is the extra overhead.

Thoughts?

like image 968
Polemarch Avatar asked May 07 '10 15:05

Polemarch


1 Answers

Depending on what you are working with, if you will guaranteeing have a header row, your approach of trying both, could be the best overall practice. Then once you determine what is going on, if you get to a row further down that doesn't have the required number of columns then you know that the format isn't correct.

Typically i would see this as a user specified option on upload, rather than a programmatic test.

like image 119
Mitchel Sellers Avatar answered Sep 20 '22 13:09

Mitchel Sellers