A lot of questions have been already asked about this topic on SO. (and many others). Among the numerous answers, none of them was really helpful to me so far. If I missed the useful one, please let me know.
I simply would like to read a CSV file with pandas into a dataframe. Sounds like a simple task.
My file Test.csv
1,2,3,4,5
1,2,3,4,5,6
,,3,4,5
1,2,3,4,5,6,7
,2,,4
My code:
import pandas as pd
df = pd.read_csv('Test.csv',header=None)
My error:
pandas.errors.ParserError: Error tokenizing data. C error: Expected 5 fields in line 2, saw 6
My guess about the issue is that Pandas looks to the first line and expects the same number of tokens in the following rows. If this is not the case it will stop with an error.
In the numerous answers, the suggestions for using options are, e.g.:
error_bad_lines=False
or header=None
or skiprows=3
and more non-helpful suggestions.
However, I don't want to ignore any lines or skip. And I don't know in advance how many columns and rows the datafile has.
So it basically boils down to how to find the maximum number of columns in the datafile. Is this the way to go? I hoped that there was an easy way to simply read a CSV file which does not have the maximum column number in the first line. Thank you for any hints. I'm using Python 3.6.3, Pandas 0.24.1 on Win7.
The error tokenizing data may arise when you're using separator (for eg. comma ',') as a delimiter and you have more separator than expected (more fields in the error row than defined in the header). So you need to either remove the additional field or remove the extra separator if it's there by mistake.
ParserError[source] Exception that is raised by an error encountered in parsing file contents. This is a generic error raised for errors encountered when functions like read_csv or read_html are parsing contents of a file. See also read_csv. Read CSV (comma-separated) file into a DataFrame.
If error_bad_lines is False, and warn_bad_lines is True, a warning for each “bad line” will be output. (Only valid with C parser).
Thank you @ALollz for the "very fresh" link (lucky coincidence) and @Rich Andrews for pointing out that my example actually is not "strictly correct" CSV data.
So, the way it works for me for the time being is adapted from @ALollz' compact solution (https://stackoverflow.com/a/55129746/7295599)
### reading an "incorrect" CSV to dataframe having a variable number of columns/tokens
import pandas as pd
df = pd.read_csv('Test.csv', header=None, sep='\n')
df = df[0].str.split(',', expand=True)
# ... do some modifications with df
### end of code
df
contains empty string ''
for the missing entries at the beginning and the middle, and None
for the missing tokens at the end.
0 1 2 3 4 5 6
0 1 2 3 4 5 None None
1 1 2 3 4 5 6 None
2 3 4 5 None None
3 1 2 3 4 5 6 7
4 2 4 None None None
If you write this again to a file via:
df.to_csv("Test.tab",sep="\t",header=False,index=False)
1 2 3 4 5
1 2 3 4 5 6
3 4 5
1 2 3 4 5 6 7
2 4
None
will be converted to empty string ''
and everything is fine.
The next level would be to account for data strings in quotes which contain the separator, but that's another topic.
1,2,3,4,5
,,3,"Hello, World!",5,6
1,2,3,4,5,6,7
Read the csv using the tolerant python csv module, and fix the loaded file prior to handing it off to pandas, which will fails on the otherwise malformed csv data regardless of the csv engine pandas uses.
import pandas as pd
import csv
not_csv = """1,2,3,4,5
1,2,3,4,5,6
,,3,4,5
1,2,3,4,5,6,7
,2,,4
"""
with open('not_a.csv', 'w') as csvfile:
csvfile.write(not_csv)
d = []
with open('not_a.csv') as csvfile:
areader = csv.reader(csvfile)
max_elems = 0
for row in areader:
if max_elems < len(row): max_elems = len(row)
csvfile.seek(0)
for i, row in enumerate(areader):
# fix my csv by padding the rows
d.append(row + ["" for x in range(max_elems-len(row))])
df = pd.DataFrame(d)
print df
# the default engine
# provides "pandas.errors.ParserError: Error tokenizing data. C error: Expected 5 fields in line 2, saw 6 "
#df = pd.read_csv('Test.csv',header=None, engine='c')
# the python csv engine
# provides "pandas.errors.ParserError: Expected 6 fields in line 4, saw 7 "
#df = pd.read_csv('Test.csv',header=None, engine='python')
Preprocess file outside of python if concerned about extra code inside python creating too much python code.
Richs-MBP:tmp randrews$ cat test.csv
1,2,3
1,
2
1,2,
,,,
Richs-MBP:tmp randrews$ awk 'BEGIN {FS=","}; {print $1","$2","$3","$4","$5}' < test.csv
1,2,3,,
1,,,,
2,,,,
1,2,,,
,,,,
I have a different take on the solution. Let pandas take care of creating the table and deleting None values and let us take care of writing a proper tokenizer.
def tokenize(str):
idx = [x for x, v in enumerate(str) if v == '\"']
if len(idx) % 2 != 0:
idx = idx[:-1]
memory = {}
for i in range(0, len(idx), 2):
val = str[idx[i]:idx[i+1]+1]
key = "_"*(len(val)-1)+"{0}".format(i)
memory[key] = val
str = str.replace(memory[key], key, 1)
return [memory.get(token, token) for token in str.split(",")]
print (tokenize("1,2,3,4,5"))
print (tokenize(",,3,\"Hello, World!\",5,6"))
print (tokenize(",,3,\"Hello,,,, World!\",5,6"))
print (tokenize(",,3,\"Hello, World!\",5,6,,3,\"Hello, World!\",5,6"))
print (tokenize(",,3,\"Hello, World!\",5,6,,3,\"Hello,,5,6"))
Output
['1', '2', '3', '4', '5']
['', '', '3', '"Hello, World!"', '5', '6']
['', '', '3', '"Hello,,,, World!"', '5', '6']
['', '', '3', '"Hello, World!"', '5', '6', '', '3', '"Hello, World!"', '5', '6']
['', '', '3', '"Hello, World!"', '5', '6', '', '3', '"Hello', '', '5', '6']
with open("test1.csv", "r") as fp:
lines = fp.readlines()
lines = list(map(lambda x: tokenize(x.strip()), lines))
df = pd.DataFrame(lines).replace(np.nan, '')
Now we can teak the tokenizer function as per our needs
In my case 1 I opened the *.csv in Excel 2 I saved the *.csv as CSV (comma-delimited) 3 I loaded the file in python via:
import pandas as pd
df = pd.read_csv('yourcsvfile.csv', sep=',')
Hope it helps!
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With