Problems reading CSV file with commas and characters in pandas

Tags:

I am trying to read a csv file using pandas and the file has a column called Tags which consist of user provided tags and has tags like - , "", '',1950's, 16th-century. Since these are user provided, there are many special characters which are entered by mistake as well. The issue is that I cannot open the csv file using pandas read_csv. It shows error:Cparser, error tokenizing data. Can someone help me with reading the csv file into pandas?

409

asked Jan 27 '13 18:01

user1992696

1 Answers

Okay. Starting from a badly formatted CSV we can't read:

>>> !cat unquoted.csv
1950's,xyz.nl/user_003,bad, 123
17th,red,flower,xyz.nl/user_001,good,203
"",xyz.nl/user_239,not very,345
>>> pd.read_csv("unquoted.csv", header=None)
Traceback (most recent call last):
  File "<ipython-input-40-7d9aadb2fad5>", line 1, in <module>
    pd.read_csv("unquoted.csv", header=None)
[...]
  File "parser.pyx", line 1572, in pandas._parser.raise_parser_error (pandas/src/parser.c:17041)
CParserError: Error tokenizing data. C error: Expected 4 fields in line 2, saw 6

We can make a nicer version, taking advantage of the fact the last three columns are well-behaved:

import csv

with open("unquoted.csv", "rb") as infile, open("quoted.csv", "wb") as outfile:
    reader = csv.reader(infile)
    writer = csv.writer(outfile)
    for line in reader:
        newline = [','.join(line[:-3])] + line[-3:]
        writer.writerow(newline)

which produces

>>> !cat quoted.csv
1950's,xyz.nl/user_003,bad, 123
"17th,red,flower",xyz.nl/user_001,good,203
,xyz.nl/user_239,not very,345

and then we can read it:

>>> pd.read_csv("quoted.csv", header=None)
                 0                1         2    3
0           1950's  xyz.nl/user_003       bad  123
1  17th,red,flower  xyz.nl/user_001      good  203
2              NaN  xyz.nl/user_239  not very  345

I'd look into fixing this problem at source and getting data in a tolerable format, though. Tricks like this shouldn't be necessary, and it would have been very easy for it to be impossible to repair.

169

answered Oct 14 '22 05:10

DSM

Related questions
                            
                                Flask/Werkzeug debugger, process model, and initialization code
                            
                                Frame buffer module of python
                            
                                Homework - Python Proxy Server
                            
                                Doctest Involving Escape Characters
                            
                                Numpy C API: Link several object files
                            
                                preserve argspec when decorating? [duplicate]
                            
                                ctypes reference double pointer
                            
                                what's the difference between pylint 'disable' and 'disable-msg'?
                            
                                Regex: validate a URL path with no query params
                            
                                Using Python's subprocess and Popen in one script to run another Python script which requires user interaction (by raw_input)
                            
                                Start Another Program From Python >Separately<
                            
                                Finding Ads on a web page
                            
                                Filter an array based on density
                            
                                Matplotlib: How to colorize a large number of line segments as independent gradients, efficiently
                            
                                Commit behavior and atomicity in python sqlite3 module
                            
                                Fire off function without waiting for answer (Python)
                            
                                Logistic regression using SciPy
                            
                                Pandas scatter plot
                            
                                Python unable to find lxml module
                            
                                A configuration file that can be read by python and shell

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Problems reading CSV file with commas and characters in pandas

Tags:

python

pandas

csv

special-characters

user1992696

People also ask

1 Answers

DSM

Recent Activity

Donate For Us