I can't figure out what's wrong with the csv file I'm trying to load:
I get error messages such as this:
b'Skipping line 2120260: expected 6 fields, saw 8\n'
But when I view the lines, they look ok to me. See below -- (I am going to press enter after each tab \t to make it easier to read).
Line 2,120,260 (failing):
['user_000104\t
2005-09-12T06:25:50Z\t
a019a8cf-2601-4a81-b3c3-7b279a873713\t
Anne Clark\t
8f8e6bc0-c3c0-4062-875a-773a1de6206f\t
Empty Me']
Line 9,000 (not failing):
['user_000001\t
2008-06-15T17:28:31Z\t
a3031680-c359-458f-a641-70ccbaec6a74\t
Steve Reich\t
2991db42-3b19-4344-a340-605ac3fbd7e9\t
Drumming: Part Iv']
If anyone wants to try it out for themselves, download this:
http://www.dtic.upf.edu/~ocelma/MusicRecommendationDataset/lastfm-1K.html
and run:
inpFile2 = pd.read_csv(fPath, sep='\t', error_bad_lines= False)
to generate the error. And:
def checkRow(path,N):
with open(path, 'r') as f:
print("This is the line.")
print(next(itertools.islice(csv.reader(f), N, None)))
to view the error row (pass in the file path and the row you are interested in). Make sure you import csv and import itertools.
Ok I manged to get the bottom of it.
The solution is to use quoting=csv.QUOTE_NONE
as a parameter in the read_csv command.
inpFile = pd.read_csv(fPath, sep='\t', error_bad_lines= False,quoting=csv.QUOTE_NONE)
And the reason for that is the existence of a double quote in one of the fields which is causing Pandas go get confused so need to tell it not to look out for strings/quotes. Making the above change seems to have loaded it.
In case you simply want to "hide" the warnings for row errors, you can use parameter warn_bad_lines=False
, as opposed to default value True, more info here: pandas.pydata.org/pandas-docs
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With