Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Panda's read_csv always crashes on small file

I am trying to import a rather small (217 rows, 87 colums, 15k) csv file for analysis in Python using Panda. The file is rather poorly structured, but I would like to still import it, since it is the raw data which I do not want to manipulate manually outside Python (e.g. with Excel). Unfortunately it always leads to a crash "The kernel appears to have died. It will restart automatically".

https://www.wakari.io/sharing/bundle/uniquely/ReadCSV

Did some research which indicated possible crashes with read_csv, but always for really large files, thus I do not understand the problem. Crash happens both using local installation (Anaconda 64-bit, IPython (Py 2.7) Notebook) and Wakari.

Can anybody help me? Would be really appreciated. Thanks a lot!

Code:

# I have a somehow ugly, illustrative csv file, but it is not too big, 217 rows, 87 colums.
# File can be downloaded at http://www.win2day.at/download/lo_1986.csv

# In[1]:

file_csv = 'lo_1986.csv'
f = open(file_csv, mode="r")
x = 0
for line in f:
    print x, ": ", line
    x = x + 1
f.close()


# Now I'd like to import this csv into Python using Pandas - but this always lead to a crash:
# "The kernel appears to have died. It will restart automatically."

# In[ ]:

import pandas as pd
pd.read_csv(file_csv, delimiter=';')

# What am I doing wrong?
like image 414
uniquely Avatar asked Aug 18 '14 23:08

uniquely


2 Answers

It is because of invalid character (e.g. 0xe0) in the file

If you add encoding parameter to the read_csv() call, you will see this stacktrace instead of a segfault

>>> df = pandas.read_csv("/tmp/lo_1986.csv", delimiter=";", encoding="utf-8")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/antkong/anaconda/lib/python2.7/site-packages/pandas/io/parsers.py", line 400, in parser_f
    return _read(filepath_or_buffer, kwds)
  File "/Users/antkong/anaconda/lib/python2.7/site-packages/pandas/io/parsers.py", line 205, in _read
    return parser.read()
  File "/Users/antkong/anaconda/lib/python2.7/site-packages/pandas/io/parsers.py", line 608, in read
    ret = self._engine.read(nrows)
  File "/Users/antkong/anaconda/lib/python2.7/site-packages/pandas/io/parsers.py", line 1028, in read
    data = self._reader.read(nrows)
  File "parser.pyx", line 706, in pandas.parser.TextReader.read (pandas/parser.c:6745)
  File "parser.pyx", line 728, in pandas.parser.TextReader._read_low_memory (pandas/parser.c:6964)
  File "parser.pyx", line 804, in pandas.parser.TextReader._read_rows (pandas/parser.c:7780)
  File "parser.pyx", line 890, in pandas.parser.TextReader._convert_column_data (pandas/parser.c:8793)
  File "parser.pyx", line 950, in pandas.parser.TextReader._convert_tokens (pandas/parser.c:9484)
  File "parser.pyx", line 1026, in pandas.parser.TextReader._convert_with_dtype (pandas/parser.c:10642)
  File "parser.pyx", line 1051, in pandas.parser.TextReader._string_convert (pandas/parser.c:10905)
  File "parser.pyx", line 1278, in pandas.parser._string_box_utf8 (pandas/parser.c:15657)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xe0 in position 0: unexpected end of data

You can do some preprocessing to remove these characters before asking pandas to read in the file

Attached a picture to highlight the invalid characters in the file

enter image description here

like image 97
Anthony Kong Avatar answered Sep 24 '22 14:09

Anthony Kong


Thanks a lot for your remarks. I could not agree more to the comment, that this is indeed a very messed up csv. But unfortunately that is the way the Austrian State Lottery shares their information an drawn numbers and payout quotes.

I continued playing around, also looking at the special characters. In the end the, at least for me, solution was surprisingly simple:

pd.read_csv(file_csv, delimiter=';', encoding='latin-1', engine='python')

The added encoding helps to display the special characters correctly, but the game changes was the engine parameter. To be honest I do not understand why, but now it works.

Thanks again!

like image 37
uniquely Avatar answered Sep 21 '22 14:09

uniquely