Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

double quoted elements in csv cant read with pandas

Tags:

I have an input file where every value is stored as a string. It is inside a csv file with each entry inside double quotes.

Example file:

"column1","column2", "column3", "column4", "column5", "column6" "AM", "07", "1", "SD", "SD", "CR" "AM", "08", "1,2,3", "PR,SD,SD", "PR,SD,SD", "PR,SD,SD" "AM", "01", "2", "SD", "SD", "SD" 

There are only six columns. What options do I need to enter to pandas read_csv to read this correctly?

I currently am trying:

import pandas as pd df = pd.read_csv(file, quotechar='"') 

but this gives me the error message: CParserError: Error tokenizing data. C error: Expected 6 fields in line 3, saw 14

Which obviously means that it is ignoring the '"' and parsing every comma as a field. However, for line 3, columns 3 through 6 should be strings with commas in them. ("1,2,3", "PR,SD,SD", "PR,SD,SD", "PR,SD,SD")

How do I get pandas.read_csv to parse this correctly?

Thanks.

like image 765
PopcornKing Avatar asked Oct 27 '14 19:10

PopcornKing


People also ask

How do I fix a double quote in a CSV file?

By "Use 2 quotes", user4035 means that 1 quote should be replaced with 2 quotes. By escaping double quotes with double quotes, you're effectively creating pairs of double quotes (2 double quotes). The final quote you see on the end is to terminate the field.

Why does my CSV file have quotation marks?

Quotation marks appear in CSV files as text qualifiers. This means, they function to wrap together text that should be kept as one value, versus what are distinct values that should be separated out.

Which is the correct function for reading a CSV file in pandas?

To read a CSV file, call the pandas function read_csv () and pass the file path as input.


1 Answers

This will work. It falls back to the python parser (as you have non-regular separators, e.g. they are comma and sometimes space). If you only have commas it would use the c-parser and be much faster.

In [1]: import csv  In [2]: !cat test.csv "column1","column2", "column3", "column4", "column5", "column6" "AM", "07", "1", "SD", "SD", "CR" "AM", "08", "1,2,3", "PR,SD,SD", "PR,SD,SD", "PR,SD,SD" "AM", "01", "2", "SD", "SD", "SD"  In [3]: pd.read_csv('test.csv',sep=',\s+',quoting=csv.QUOTE_ALL) pandas/io/parsers.py:637: ParserWarning: Falling back to the 'python' engine because the 'c' engine does not support regex separators; you can avoid this warning by specifying engine='python'.   ParserWarning) Out[3]:       "column1","column2" "column3"   "column4"   "column5"   "column6" "AM"                "07"       "1"        "SD"        "SD"        "CR" "AM"                "08"   "1,2,3"  "PR,SD,SD"  "PR,SD,SD"  "PR,SD,SD" "AM"                "01"       "2"        "SD"        "SD"        "SD" 
like image 104
Jeff Avatar answered Sep 28 '22 21:09

Jeff