I am using conda
python
2.7
python --version
Python 2.7.12 :: Anaconda 2.4.1 (x86_64)
I have fallowing method to read large gzip files:
df = pd.read_csv(os.path.join(filePath, fileName),
sep='|', compression = 'gzip', dtype='unicode', error_bad_lines=False)
but when I read the file I get the following error:
pandas.parser.CParserError: Error tokenizing data. C error: Calling read(nbytes) on source failed. Try engine='python'.
Segmentation fault: 11
I read all the existing answers but most of those questions had errors such as additional columns. I was already handling that with error_bad_lines=False
option.
What are my options here?
Found something interesting when I tried to uncompress the file:
gunzip -k myfile.txt.gz
gunzip: myfile.txt.gz: unexpected end of file
gunzip: myfile.txt.gz: uncompress failed
Chances are the path you put is actually that of a folder
instead of the file
that needs to be read.
Pandas.read_csv
can't read folders and need explicit compatible file names.
I didn't really find a python solution but using unix
tools I manage to find a solution:
First I use zless myfile.txt.gz > uncompressedMyfile.txt
then I use sed
tool to remove the last line because I clearly saw that last line was corrupt.
sed '$d' uncompressedMyfile.txt
I gzipped the file again gzip -k uncompressedMyfile.txt
I was able to successfully read the file with following python code:
try:
df = pd.read_csv(os.path.join(filePath, fileName),
sep='|', compression = 'gzip', dtype='unicode', error_bad_lines=False)
except CParserError:
print "Something wrong the file"
return df
Sometimes the error shows up if you have the file already open. Try closing the file and re-running
The input zip file is corrupted. Get a proper copy of this file from the source of try to use zip repairing tools before you pass it along to pandas.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With