I need to be able to parse both CSV and TSV files. I can't rely on the users to know the difference, so I would like to avoid asking the user to select the type. Is there a simple way to detect which delimiter is in use?
One way would be to read in every line and count both tabs and commas and find out which is most consistently used in every line. Of course, the data could include commas or tabs, so that may be easier said than done.
Edit: Another fun aspect of this project is that I will also need to detect the schema of the file when I read it in because it could be one of many. This means that I won't know how many fields I have until I can parse it.
Here is an explanation update on what the text file looks like: If the text file has a tab as delimiter, then it delimited on every line. If the text file has a space as delimiter, then it is NOT delimited every line.
Delimited text files (. txt), in which the TAB character (ASCII character code 009) typically separates each field of text. Comma separated values text files (. csv), in which the comma character (,) typically separates each field of text.
Mac/Windows Select the CSV file that has the data clustered into one column. Select Delimited, then make sure the File Origin is Unicode UTF-8. Select Comma (this is Affinity's default list separator). The preview will show the columns being separated.
A delimiter is one or more characters that separate text strings. Common delimiters are commas (,), semicolon (;), quotes ( ", ' ), braces ({}), pipes (|), or slashes ( / \ ). When a program stores sequential or tabular data, it delimits each item of data with a predefined character.
In Python, there is a Sniffer class in the csv module that can be used to guess a given file's delimiter and quote characters. Its strategy is (quoted from csv.py's docstrings):
[First, look] for text enclosed between two identical quotes (the probable quotechar) which are preceded and followed by the same character (the probable delimiter). For example:
,'some text',
The quote with the most wins, same with the delimiter. If there is no quotechar the delimiter can't be determined this way.
In that case, try the following:
The delimiter should occur the same number of times on each row. However, due to malformed data, it may not. We don't want an all or nothing approach, so we allow for small variations in this number.
For performance reasons, the data is evaluated in chunks, so it can try and evaluate the smallest portion of the data possible, evaluating additional chunks as necessary.
I'm not going to quote the source code here - it's in the Lib directory of every Python installation.
Remember that CSV can also use semicolons instead of commas as delimiters (e. g. in German versions of Excel, CSVs are semicolon-delimited because commas are used as decimal separators in Germany...)
You could show them the results in preview window - similar to the way Excel does it. It's pretty clear when the wrong delimiter is being used in that case. You could then allow them to select a range of delimiters and have the preview update in real time.
Then you could just make a simple guess as to the delimiter to start with (e.g. does a comma or a tab come first).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With