I am trying to import a rather small (217 rows, 87 colums, 15k) csv file for analysis in Python using Panda. The file is rather poorly structured, but I would like to still import it, since it is the raw data which I do not want to manipulate manually outside Python (e.g. with Excel). Unfortunately it always leads to a crash "The kernel appears to have died. It will restart automatically".
https://www.wakari.io/sharing/bundle/uniquely/ReadCSV
Did some research which indicated possible crashes with read_csv, but always for really large files, thus I do not understand the problem. Crash happens both using local installation (Anaconda 64-bit, IPython (Py 2.7) Notebook) and Wakari.
Can anybody help me? Would be really appreciated. Thanks a lot!
Code:
# I have a somehow ugly, illustrative csv file, but it is not too big, 217 rows, 87 colums.
# File can be downloaded at http://www.win2day.at/download/lo_1986.csv
# In[1]:
file_csv = 'lo_1986.csv'
f = open(file_csv, mode="r")
x = 0
for line in f:
print x, ": ", line
x = x + 1
f.close()
# Now I'd like to import this csv into Python using Pandas - but this always lead to a crash:
# "The kernel appears to have died. It will restart automatically."
# In[ ]:
import pandas as pd
pd.read_csv(file_csv, delimiter=';')
# What am I doing wrong?
It is because of invalid character (e.g. 0xe0) in the file
If you add encoding
parameter to the read_csv() call, you will see this stacktrace instead of a segfault
>>> df = pandas.read_csv("/tmp/lo_1986.csv", delimiter=";", encoding="utf-8")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/antkong/anaconda/lib/python2.7/site-packages/pandas/io/parsers.py", line 400, in parser_f
return _read(filepath_or_buffer, kwds)
File "/Users/antkong/anaconda/lib/python2.7/site-packages/pandas/io/parsers.py", line 205, in _read
return parser.read()
File "/Users/antkong/anaconda/lib/python2.7/site-packages/pandas/io/parsers.py", line 608, in read
ret = self._engine.read(nrows)
File "/Users/antkong/anaconda/lib/python2.7/site-packages/pandas/io/parsers.py", line 1028, in read
data = self._reader.read(nrows)
File "parser.pyx", line 706, in pandas.parser.TextReader.read (pandas/parser.c:6745)
File "parser.pyx", line 728, in pandas.parser.TextReader._read_low_memory (pandas/parser.c:6964)
File "parser.pyx", line 804, in pandas.parser.TextReader._read_rows (pandas/parser.c:7780)
File "parser.pyx", line 890, in pandas.parser.TextReader._convert_column_data (pandas/parser.c:8793)
File "parser.pyx", line 950, in pandas.parser.TextReader._convert_tokens (pandas/parser.c:9484)
File "parser.pyx", line 1026, in pandas.parser.TextReader._convert_with_dtype (pandas/parser.c:10642)
File "parser.pyx", line 1051, in pandas.parser.TextReader._string_convert (pandas/parser.c:10905)
File "parser.pyx", line 1278, in pandas.parser._string_box_utf8 (pandas/parser.c:15657)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xe0 in position 0: unexpected end of data
You can do some preprocessing to remove these characters before asking pandas to read in the file
Attached a picture to highlight the invalid characters in the file
Thanks a lot for your remarks. I could not agree more to the comment, that this is indeed a very messed up csv. But unfortunately that is the way the Austrian State Lottery shares their information an drawn numbers and payout quotes.
I continued playing around, also looking at the special characters. In the end the, at least for me, solution was surprisingly simple:
pd.read_csv(file_csv, delimiter=';', encoding='latin-1', engine='python')
The added encoding helps to display the special characters correctly, but the game changes was the engine parameter. To be honest I do not understand why, but now it works.
Thanks again!
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With