I am trying to read in a dataset called df1, but it does not work <pre class="prettyprint"><code>import pandas as pd df1=pd.read_csv("https://raw.githubusercontent.com/tuyenhavan/Statistics/Dataset/World_Life_Expectancy.csv",sep=";") df1.head() </code></pre> Here are huge errors from the above code, but this is the most relevant <pre class="prettyprint"><code>UnicodeDecodeError: 'utf-8' codec can't decode byte 0x92 in position 18: invalid start byte </code></pre>

It turned out that the csv created in mac os is being parsed on a windows machine, I got the UnicodeDecodeError. To get rid of this error, try passing argument encoding='mac-roman' to read_csv method of pandas library. <pre class="prettyprint"><code>import pandas as pd df1=pd.read_csv("https://raw.githubusercontent.com/tuyenhavan/Statistics/Dataset/World_Life_Expectancy.csv",sep=";", encoding='mac_roman') df1.head() </code></pre> Output: <pre class="prettyprint"><code> 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 Unnamed: 15 2014 2015 0 Afghanistan 55.1 55.5 55.9 56.2 56.6 57.0 57.4 57.8 58.2 58.6 59.0 59.3 59.7 60.0 NaN 60.4 60.7 1 Albania 74.3 74.7 75.2 75.5 75.8 76.1 76.3 76.5 76.7 76.8 77.0 77.2 77.4 77.6 NaN 77.8 78.0 2 Algeria 70.2 70.6 71.0 71.4 71.8 72.2 72.6 72.9 73.2 73.5 73.8 74.1 74.3 74.6 NaN 74.8 75.0 3 American Samoa .. .. .. .. .. .. .. .. .. .. .. .. .. .. NaN .. .. 4 Andorra .. .. .. .. .. .. .. .. .. .. .. .. .. .. NaN .. .. </code></pre>

'utf-8' codec can't decode byte 0x92 in position 18: invalid start byte

Tags:

python

python-3.x

pandas

data-import

I am trying to read in a dataset called df1, but it does not work

import pandas as pd
df1=pd.read_csv("https://raw.githubusercontent.com/tuyenhavan/Statistics/Dataset/World_Life_Expectancy.csv",sep=";")

df1.head()

Here are huge errors from the above code, but this is the most relevant

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x92 in position 18: invalid start byte

446

asked Sep 01 '17 12:09

Tuyen

2 Answers

The data is indeed not encoded as UTF-8; everything is ASCII except for that single 0x92 byte:

b'Korea, Dem. People\x92s Rep.'

Decode it as Windows codepage 1252 instead, where 0x92 is a fancy quote, ’:

df1 = pd.read_csv("https://raw.githubusercontent.com/tuyenhavan/Statistics/Dataset/World_Life_Expectancy.csv",
                  sep=";", encoding='cp1252')

Demo:

>>> import pandas as pd
>>> df1 = pd.read_csv("https://raw.githubusercontent.com/tuyenhavan/Statistics/Dataset/World_Life_Expectancy.csv",
...                   sep=";", encoding='cp1252')
>>> df1.head()
                   2000  2001  2002  2003  2004  2005  2006  2007  2008  2009  \
0     Afghanistan  55.1  55.5  55.9  56.2  56.6  57.0  57.4  57.8  58.2  58.6
1         Albania  74.3  74.7  75.2  75.5  75.8  76.1  76.3  76.5  76.7  76.8
2         Algeria  70.2  70.6  71.0  71.4  71.8  72.2  72.6  72.9  73.2  73.5
3  American Samoa    ..    ..    ..    ..    ..    ..    ..    ..    ..    ..
4         Andorra    ..    ..    ..    ..    ..    ..    ..    ..    ..    ..

   2010  2011  2012  2013  Unnamed: 15  2014  2015
0  59.0  59.3  59.7  60.0          NaN  60.4  60.7
1  77.0  77.2  77.4  77.6          NaN  77.8  78.0
2  73.8  74.1  74.3  74.6          NaN  74.8  75.0
3    ..    ..    ..    ..          NaN    ..    ..
4    ..    ..    ..    ..          NaN    ..    ..

I note however, that Pandas seems to take the HTTP headers at face value too and produces a Mojibake when you load your data from a URL. When I save the data directly to disk, then load it with pd.read_csv() the data is correctly decoded, but loading from the URL produces re-coded data:

>>> df1[' '][102]
'Korea, Dem. Peopleâ€™s Rep.'
>>> df1[' '][102].encode('cp1252').decode('utf8')
'Korea, Dem. People’s Rep.'

This is a known bug in Pandas. You can work around this by using urllib.request to load the URL and pass that to pd.read_csv() instead:

>>> import urllib.request
>>> with urllib.request.urlopen("https://raw.githubusercontent.com/tuyenhavan/Statistics/Dataset/World_Life_Expectancy.csv") as resp:
...     df1 = pd.read_csv(resp, sep=";", encoding='cp1252')
...
>>> df1[' '][102]
'Korea, Dem. People’s Rep.'

125

answered Oct 24 '22 08:10

Martijn Pieters

It turned out that the csv created in mac os is being parsed on a windows machine, I got the UnicodeDecodeError. To get rid of this error, try passing argument encoding='mac-roman' to read_csv method of pandas library.

import pandas as pd
df1=pd.read_csv("https://raw.githubusercontent.com/tuyenhavan/Statistics/Dataset/World_Life_Expectancy.csv",sep=";", encoding='mac_roman')
df1.head()

Output:

    2000    2001    2002    2003    2004    2005    2006    2007    2008    2009    2010    2011    2012    2013    Unnamed: 15 2014    2015
0   Afghanistan 55.1    55.5    55.9    56.2    56.6    57.0    57.4    57.8    58.2    58.6    59.0    59.3    59.7    60.0    NaN 60.4    60.7
1   Albania 74.3    74.7    75.2    75.5    75.8    76.1    76.3    76.5    76.7    76.8    77.0    77.2    77.4    77.6    NaN 77.8    78.0
2   Algeria 70.2    70.6    71.0    71.4    71.8    72.2    72.6    72.9    73.2    73.5    73.8    74.1    74.3    74.6    NaN 74.8    75.0
3   American Samoa  ..  ..  ..  ..  ..  ..  ..  ..  ..  ..  ..  ..  ..  ..  NaN ..  ..
4   Andorra ..  ..  ..  ..  ..  ..  ..  ..  ..  ..  ..  ..  ..  ..  NaN ..  ..

answered Oct 24 '22 10:10

navule

Related questions
                            
                                Install virtualenv and virtualenvwrapper on MacOS
                            
                                Print empty line?
                            
                                aiogevent event loop "fails" to track greenlets
                            
                                In Bokeh, how do I add tooltips to a Timeseries chart (hover tool)?
                            
                                Most Pythonic way to declare an abstract class property
                            
                                Customize module search path (PYTHONPATH) via pipenv
                            
                                GridSearchCV - XGBoost - Early Stopping
                            
                                turn warning off in a cell jupyter notebook
                            
                                Can't import annotations from __future__
                            
                                How can I execute Python scripts using Anaconda's version of Python?
                            
                                Why is str.strip() so much faster than str.strip(' ')?
                            
                                How to create a copy of python iterator? [duplicate]
                            
                                Hide Axis Titles in Seaborn
                            
                                How do I import module in jupyter notebook directory into notebooks in lower directories? [duplicate]
                            
                                Read MP3 in Python 3
                            
                                Testing Equivalence of xml.etree.ElementTree
                            
                                How can you test that two dictionaries are equal with pytest in python
                            
                                check for file existence in Python 3 [duplicate]
                            
                                How to check text file exists and is not empty in python
                            
                                Ubuntu: pip not working with python3.4

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With