Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas ParserError EOF character when reading multiple csv files to HDF5

Using Python3, Pandas 0.12

I'm trying to write multiple csv files (total size is 7.9 GB) to a HDF5 store to process later onwards. The csv files contain around a million of rows each, 15 columns and data types are mostly strings, but some floats. However when I'm trying to read the csv files I get the following error:

Traceback (most recent call last):   File "filter-1.py", line 38, in <module>     to_hdf()   File "filter-1.py", line 31, in to_hdf     for chunk in reader:   File "C:\Python33\lib\site-packages\pandas\io\parsers.py", line 578, in __iter__     yield self.read(self.chunksize)   File "C:\Python33\lib\site-packages\pandas\io\parsers.py", line 608, in read     ret = self._engine.read(nrows)   File "C:\Python33\lib\site-packages\pandas\io\parsers.py", line 1028, in read     data = self._reader.read(nrows)   File "parser.pyx", line 706, in pandas.parser.TextReader.read (pandas\parser.c:6745)   File "parser.pyx", line 740, in pandas.parser.TextReader._read_low_memory (pandas\parser.c:7146)   File "parser.pyx", line 781, in pandas.parser.TextReader._read_rows (pandas\parser.c:7568)   File "parser.pyx", line 768, in pandas.parser.TextReader._tokenize_rows (pandas\parser.c:7451)   File "parser.pyx", line 1661, in pandas.parser.raise_parser_error (pandas\parser.c:18744) pandas.parser.CParserError: Error tokenizing data. C error: EOF inside string starting at line 754991 Closing remaining open files: ta_store.h5... done  

Edit:

I managed to find a file that produced this problem. I think it's reading an EOF character. However I have no clue to overcome this problem. Given the large size of the combined files I think it's too cumbersome to check each single character in each string. (Even then I would still not be sure what to do.) As far as I checked, there are no strange characters in the csv files that could raise the error. I also tried passing error_bad_lines=False to pd.read_csv(), but the error persists.

My code is the following:

# -*- coding: utf-8 -*-  import pandas as pd import os from glob import glob   def list_files(path=os.getcwd()):     ''' List all files in specified path '''     list_of_files = [f for f in glob('2013-06*.csv')]     return list_of_files   def to_hdf():     """ Function that reads multiple csv files to HDF5 Store """     # Defining path name     path = 'ta_store.h5'     # If path exists delete it such that a new instance can be created     if os.path.exists(path):         os.remove(path)     # Creating HDF5 Store     store = pd.HDFStore(path)      # Reading csv files from list_files function     for f in list_files():         # Creating reader in chunks -- reduces memory load         reader = pd.read_csv(f, chunksize=50000)         # Looping over chunks and storing them in store file, node name 'ta_data'         for chunk in reader:             chunk.to_hdf(store, 'ta_data', mode='w', table=True)      # Return store     return store.select('ta_data')     return 'Finished reading to HDF5 Store, continuing processing data.'  to_hdf() 

Edit

If I go into the CSV file that raises the CParserError EOF... and manually delete all rows after the line that is causing the problem, the csv file is read properly. However all I'm deleting are blank rows anyway. The weird thing is that when I manually correct the erroneous csv files, they are loaded fine into the store individually. But when I again use a list of multiple files the 'false' files still return me errors.

like image 710
Matthijs Avatar asked Aug 02 '13 11:08

Matthijs


2 Answers

I have the same problem, and after adding these two params to my code, the problem is gone.

read_csv (...quoting=3, error_bad_lines=False)

like image 26
weefwefwqg3 Avatar answered Sep 20 '22 22:09

weefwefwqg3


I had a similar problem. The line listed with the 'EOF inside string' had a string that contained within it a single quote mark. When I added the option quoting=csv.QUOTE_NONE it fixed my problem.

For example:

import csv df = pd.read_csv(csvfile, header = None, delimiter="\t", quoting=csv.QUOTE_NONE, encoding='utf-8') 
like image 134
Selah Avatar answered Sep 21 '22 22:09

Selah