Python: Error tokenizing data. C error: Calling read(nbytes) on source failed with input nzip file

Question

I am using conda python 2.7

python --version
Python 2.7.12 :: Anaconda 2.4.1 (x86_64)

I have fallowing method to read large gzip files:

df = pd.read_csv(os.path.join(filePath, fileName),
     sep='|', compression = 'gzip', dtype='unicode', error_bad_lines=False)

but when I read the file I get the following error:

pandas.parser.CParserError: Error tokenizing data. C error: Calling read(nbytes) on source failed. Try engine='python'.
Segmentation fault: 11

I read all the existing answers but most of those questions had errors such as additional columns. I was already handling that with error_bad_lines=False option.

What are my options here?

Found something interesting when I tried to uncompress the file:

gunzip -k myfile.txt.gz 
gunzip: myfile.txt.gz: unexpected end of file
gunzip: myfile.txt.gz: uncompress failed

arunavkonwar · Accepted Answer

Chances are the path you put is actually that of a folder instead of the file that needs to be read.

Pandas.read_csv can't read folders and need explicit compatible file names.

add-semi-colons · Answer

I didn't really find a python solution but using unix tools I manage to find a solution:

First I use zless myfile.txt.gz > uncompressedMyfile.txt then I use sed tool to remove the last line because I clearly saw that last line was corrupt.

sed '$d' uncompressedMyfile.txt

I gzipped the file again gzip -k uncompressedMyfile.txt

I was able to successfully read the file with following python code:

try:
    df = pd.read_csv(os.path.join(filePath, fileName),
                        sep='|', compression = 'gzip', dtype='unicode', error_bad_lines=False)
except CParserError:
    print "Something wrong the file"
return df

Aseem Ahir · Answer

Sometimes the error shows up if you have the file already open. Try closing the file and re-running

Zeugma · Answer

The input zip file is corrupted. Get a proper copy of this file from the source of try to use zip repairing tools before you pass it along to pandas.

Python: Error tokenizing data. C error: Calling read(nbytes) on source failed with input nzip file

Tags:

python

pandas

add-semi-colons

4 Answers

arunavkonwar

add-semi-colons

Aseem Ahir

Zeugma

Recent Activity

Donate For Us

Python: Error tokenizing data. C error: Calling read(nbytes) on source failed with input nzip file

Tags:

python

pandas

add-semi-colons

4 Answers

arunavkonwar

add-semi-colons

Aseem Ahir

Zeugma

Related questions

Recent Activity

Donate For Us