Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas DataFrame Read Skipping line XXX: expected X fields, saw Y

Tags:

python

pandas

csv

I can't figure out what's wrong with the csv file I'm trying to load:

I get error messages such as this: b'Skipping line 2120260: expected 6 fields, saw 8\n'

But when I view the lines, they look ok to me. See below -- (I am going to press enter after each tab \t to make it easier to read).

Line 2,120,260 (failing): ['user_000104\t 2005-09-12T06:25:50Z\t a019a8cf-2601-4a81-b3c3-7b279a873713\t Anne Clark\t 8f8e6bc0-c3c0-4062-875a-773a1de6206f\t Empty Me']

Line 9,000 (not failing): ['user_000001\t 2008-06-15T17:28:31Z\t a3031680-c359-458f-a641-70ccbaec6a74\t Steve Reich\t 2991db42-3b19-4344-a340-605ac3fbd7e9\t Drumming: Part Iv']

If anyone wants to try it out for themselves, download this:

http://www.dtic.upf.edu/~ocelma/MusicRecommendationDataset/lastfm-1K.html

and run: inpFile2 = pd.read_csv(fPath, sep='\t', error_bad_lines= False)

to generate the error. And: def checkRow(path,N): with open(path, 'r') as f: print("This is the line.") print(next(itertools.islice(csv.reader(f), N, None)))

to view the error row (pass in the file path and the row you are interested in). Make sure you import csv and import itertools.

like image 919
user1761806 Avatar asked May 10 '17 11:05

user1761806


2 Answers

Ok I manged to get the bottom of it.

The solution is to use quoting=csv.QUOTE_NONE as a parameter in the read_csv command. inpFile = pd.read_csv(fPath, sep='\t', error_bad_lines= False,quoting=csv.QUOTE_NONE)

And the reason for that is the existence of a double quote in one of the fields which is causing Pandas go get confused so need to tell it not to look out for strings/quotes. Making the above change seems to have loaded it.

like image 53
user1761806 Avatar answered Sep 18 '22 16:09

user1761806


In case you simply want to "hide" the warnings for row errors, you can use parameter warn_bad_lines=False , as opposed to default value True, more info here: pandas.pydata.org/pandas-docs

like image 41
Lorenzo Bassetti Avatar answered Sep 17 '22 16:09

Lorenzo Bassetti