Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python: Error tokenizing data. C error: Calling read(nbytes) on source failed with input nzip file

Tags:

python

pandas

I am using conda python 2.7

python --version
Python 2.7.12 :: Anaconda 2.4.1 (x86_64)

I have fallowing method to read large gzip files:

df = pd.read_csv(os.path.join(filePath, fileName),
     sep='|', compression = 'gzip', dtype='unicode', error_bad_lines=False)

but when I read the file I get the following error:

pandas.parser.CParserError: Error tokenizing data. C error: Calling read(nbytes) on source failed. Try engine='python'.
Segmentation fault: 11

I read all the existing answers but most of those questions had errors such as additional columns. I was already handling that with error_bad_lines=False option.

What are my options here?

Found something interesting when I tried to uncompress the file:

gunzip -k myfile.txt.gz 
gunzip: myfile.txt.gz: unexpected end of file
gunzip: myfile.txt.gz: uncompress failed
like image 813
add-semi-colons Avatar asked Nov 27 '16 23:11

add-semi-colons


4 Answers

Chances are the path you put is actually that of a folder instead of the file that needs to be read.

Pandas.read_csv can't read folders and need explicit compatible file names.

like image 69
arunavkonwar Avatar answered Oct 12 '22 11:10

arunavkonwar


I didn't really find a python solution but using unix tools I manage to find a solution:

First I use zless myfile.txt.gz > uncompressedMyfile.txt then I use sed tool to remove the last line because I clearly saw that last line was corrupt.

sed '$d' uncompressedMyfile.txt

I gzipped the file again gzip -k uncompressedMyfile.txt

I was able to successfully read the file with following python code:

try:
    df = pd.read_csv(os.path.join(filePath, fileName),
                        sep='|', compression = 'gzip', dtype='unicode', error_bad_lines=False)
except CParserError:
    print "Something wrong the file"
return df
like image 28
add-semi-colons Avatar answered Oct 12 '22 11:10

add-semi-colons


Sometimes the error shows up if you have the file already open. Try closing the file and re-running

like image 3
Aseem Ahir Avatar answered Oct 12 '22 12:10

Aseem Ahir


The input zip file is corrupted. Get a proper copy of this file from the source of try to use zip repairing tools before you pass it along to pandas.

like image 2
Zeugma Avatar answered Oct 12 '22 11:10

Zeugma