Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas read_csv expects wrong number of columns, with ragged csv file

I have a csv file that has a few hundred rows and 26 columns, but the last few columns only have a value in a few rows and they are towards the middle or end of the file. When I try to read it in using read_csv() I get the following error. "ValueError: Expecting 23 columns, got 26 in row 64"

I can't see where to explicitly state the number of columns in the file, or how it determines how many columns it thinks the file should have. The dump is below

In [3]:  infile =open(easygui.fileopenbox(),"r") pledge = read_csv(infile,parse_dates='true')   --------------------------------------------------------------------------- ValueError                                Traceback (most recent call last) <ipython-input-3-b35e7a16b389> in <module>()       1 infile =open(easygui.fileopenbox(),"r")       2  ----> 3 pledge = read_csv(infile,parse_dates='true')   C:\Python27\lib\site-packages\pandas-0.8.1-py2.7-win32.egg\pandas\io\parsers.pyc in read_csv(filepath_or_buffer, sep, dialect, header, index_col, names, skiprows, na_values, thousands, comment, parse_dates, keep_date_col, dayfirst, date_parser, nrows, iterator, chunksize, skip_footer, converters, verbose, delimiter, encoding, squeeze)     234         kwds['delimiter'] = sep     235  --> 236     return _read(TextParser, filepath_or_buffer, kwds)     237      238 @Appender(_read_table_doc)  C:\Python27\lib\site-packages\pandas-0.8.1-py2.7-win32.egg\pandas\io\parsers.pyc in _read(cls, filepath_or_buffer, kwds)     189         return parser     190  --> 191     return parser.get_chunk()     192      193 @Appender(_read_csv_doc)  C:\Python27\lib\site-packages\pandas-0.8.1-py2.7-win32.egg\pandas\io\parsers.pyc in get_chunk(self, rows)     779             msg = ('Expecting %d columns, got %d in row %d' %     780                    (col_len, zip_len, row_num)) --> 781             raise ValueError(msg)     782      783         data = dict((k, v) for k, v in izip(self.columns, zipped_content))  ValueError: Expecting 23 columns, got 26 in row 64 
like image 374
chrisfs Avatar asked Nov 22 '13 20:11

chrisfs


People also ask

Is read_csv faster than Read_excel?

Idea #2: Use CSVs rather than Excel FilesImporting csv files in Python is 100x faster than Excel files. We can now load these files in 0.63 seconds. That's nearly 10 times faster!

What is the default separator in PD read_csv?

The default value of the sep parameter is the comma (,) which means if we don't specify the sep parameter in our read_csv() function, it is understood that our file is using comma as the delimiter.

What does Parse_dates in pandas do?

If True and parse_dates is enabled, pandas will attempt to infer the format of the datetime strings in the columns, and if it can be inferred, switch to a faster method of parsing them. In some cases this can increase the parsing speed by 5-10x.


2 Answers

You can use names parameter. For example, if you have csv file like this:

1,2,1 2,3,4,2,3 1,2,3,3 1,2,3,4,5,6 

And try to read it, you'll receive and error

>>> pd.read_csv(r'D:/Temp/tt.csv') Traceback (most recent call last): ... Expected 5 fields in line 4, saw 6 

But if you pass names parameters, you'll get result:

>>> pd.read_csv(r'D:/Temp/tt.csv', names=list('abcdef'))    a  b  c   d   e   f 0  1  2  1 NaN NaN NaN 1  2  3  4   2   3 NaN 2  1  2  3   3 NaN NaN 3  1  2  3   4   5   6 

Hope it helps.

like image 75
Roman Pekar Avatar answered Oct 09 '22 07:10

Roman Pekar


you can also load the CSV with separator '^', to load the entire string to a column, then use split to break the string into required delimiters. After that, you do a concat to merge with the original dataframe (if needed).

temp=pd.read_csv('test.csv',sep='^',header=None,prefix='X') temp2=temp.X0.str.split(',',expand=True) del temp['X0'] temp=pd.concat([temp,temp2],axis=1) 
like image 41
arjepak Avatar answered Oct 09 '22 09:10

arjepak