I am trying to read a csv file using pandas and the file has a column called Tags which consist of user provided tags and has tags like - , "", '',1950's, 16th-century. Since these are user provided, there are many special characters which are entered by mistake as well. The issue is that I cannot open the csv file using pandas read_csv. It shows error:Cparser, error tokenizing data. Can someone help me with reading the csv file into pandas?
The read_csv() method of pandas DataFrame class reads a CSV file and loads each record as a row in the DataFrame. In the similar way the pandas DataFrame class supports operations like reading and writing DataFrame contents from/to MySQL; and reading and writing DataFrame contents from/to PostgreSQL.
7. Re: Handling 'comma' in the data while writing to a CSV. So for data fields that contain a comma, you should just be able to wrap them in a double quote. Fields containing line breaks (CRLF), double quotes, and commas should be enclosed in double-quotes.
Using the "From Text" feature in Excel Select the CSV file that has the data clustered into one column. Select Delimited, then make sure the File Origin is Unicode UTF-8. Select Comma (this is Affinity's default list separator). The preview will show the columns being separated.
Okay. Starting from a badly formatted CSV we can't read:
>>> !cat unquoted.csv
1950's,xyz.nl/user_003,bad, 123
17th,red,flower,xyz.nl/user_001,good,203
"",xyz.nl/user_239,not very,345
>>> pd.read_csv("unquoted.csv", header=None)
Traceback (most recent call last):
File "<ipython-input-40-7d9aadb2fad5>", line 1, in <module>
pd.read_csv("unquoted.csv", header=None)
[...]
File "parser.pyx", line 1572, in pandas._parser.raise_parser_error (pandas/src/parser.c:17041)
CParserError: Error tokenizing data. C error: Expected 4 fields in line 2, saw 6
We can make a nicer version, taking advantage of the fact the last three columns are well-behaved:
import csv
with open("unquoted.csv", "rb") as infile, open("quoted.csv", "wb") as outfile:
reader = csv.reader(infile)
writer = csv.writer(outfile)
for line in reader:
newline = [','.join(line[:-3])] + line[-3:]
writer.writerow(newline)
which produces
>>> !cat quoted.csv
1950's,xyz.nl/user_003,bad, 123
"17th,red,flower",xyz.nl/user_001,good,203
,xyz.nl/user_239,not very,345
and then we can read it:
>>> pd.read_csv("quoted.csv", header=None)
0 1 2 3
0 1950's xyz.nl/user_003 bad 123
1 17th,red,flower xyz.nl/user_001 good 203
2 NaN xyz.nl/user_239 not very 345
I'd look into fixing this problem at source and getting data in a tolerable format, though. Tricks like this shouldn't be necessary, and it would have been very easy for it to be impossible to repair.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With