I have an input file where every value is stored as a string. It is inside a csv file with each entry inside double quotes.
Example file:
"column1","column2", "column3", "column4", "column5", "column6" "AM", "07", "1", "SD", "SD", "CR" "AM", "08", "1,2,3", "PR,SD,SD", "PR,SD,SD", "PR,SD,SD" "AM", "01", "2", "SD", "SD", "SD"
There are only six columns. What options do I need to enter to pandas read_csv to read this correctly?
I currently am trying:
import pandas as pd df = pd.read_csv(file, quotechar='"')
but this gives me the error message: CParserError: Error tokenizing data. C error: Expected 6 fields in line 3, saw 14
Which obviously means that it is ignoring the '"' and parsing every comma as a field. However, for line 3, columns 3 through 6 should be strings with commas in them. ("1,2,3", "PR,SD,SD", "PR,SD,SD", "PR,SD,SD")
How do I get pandas.read_csv to parse this correctly?
Thanks.
By "Use 2 quotes", user4035 means that 1 quote should be replaced with 2 quotes. By escaping double quotes with double quotes, you're effectively creating pairs of double quotes (2 double quotes). The final quote you see on the end is to terminate the field.
Quotation marks appear in CSV files as text qualifiers. This means, they function to wrap together text that should be kept as one value, versus what are distinct values that should be separated out.
To read a CSV file, call the pandas function read_csv () and pass the file path as input.
This will work. It falls back to the python parser (as you have non-regular separators, e.g. they are comma and sometimes space). If you only have commas it would use the c-parser and be much faster.
In [1]: import csv In [2]: !cat test.csv "column1","column2", "column3", "column4", "column5", "column6" "AM", "07", "1", "SD", "SD", "CR" "AM", "08", "1,2,3", "PR,SD,SD", "PR,SD,SD", "PR,SD,SD" "AM", "01", "2", "SD", "SD", "SD" In [3]: pd.read_csv('test.csv',sep=',\s+',quoting=csv.QUOTE_ALL) pandas/io/parsers.py:637: ParserWarning: Falling back to the 'python' engine because the 'c' engine does not support regex separators; you can avoid this warning by specifying engine='python'. ParserWarning) Out[3]: "column1","column2" "column3" "column4" "column5" "column6" "AM" "07" "1" "SD" "SD" "CR" "AM" "08" "1,2,3" "PR,SD,SD" "PR,SD,SD" "PR,SD,SD" "AM" "01" "2" "SD" "SD" "SD"
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With