Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to read UTF-8 files with Pandas?

I have a UTF-8 file with twitter data and I am trying to read it into a Python data frame but I can only get an 'object' type instead of unicode strings:

# file 1459966468_324.csv
#1459966468_324.csv: UTF-8 Unicode English text
df = pd.read_csv('1459966468_324.csv', dtype={'text': unicode})
df.dtypes
text               object
Airline            object
name               object
retweet_count     float64
sentiment          object
tweet_location     object
dtype: object

What is the right way of reading and coercing UTF-8 data into unicode with Pandas?

This does not solve the problem:

df = pd.read_csv('1459966468_324.csv', encoding = 'utf8')
df.apply(lambda x: pd.lib.infer_dtype(x.values))

Text file is here: https://raw.githubusercontent.com/l1x/nlp/master/1459966468_324.csv

like image 717
Istvan Avatar asked Apr 06 '16 21:04

Istvan


People also ask

How do I specify encoding in Pandas?

str. encode() function is used to encode character string in the Series/Index using indicated encoding.

Can we read text file using Pandas?

We can read data from a text file using read_table() in pandas. This function reads a general delimited file to a DataFrame object. This function is essentially the same as the read_csv() function but with the delimiter = '\t', instead of a comma by default.


4 Answers

As the other poster mentioned, you might try:

df = pd.read_csv('1459966468_324.csv', encoding='utf8') 

However this could still leave you looking at 'object' when you print the dtypes. To confirm they are utf8, try this line after reading the CSV:

df.apply(lambda x: pd.lib.infer_dtype(x.values)) 

Example output:

args            unicode date         datetime64 host            unicode kwargs          unicode operation       unicode 
like image 95
Sam Avatar answered Sep 21 '22 06:09

Sam


Use the encoding keyword with the appropriate parameter:

df = pd.read_csv('1459966468_324.csv', encoding='utf8') 
like image 24
Stefan Avatar answered Sep 23 '22 06:09

Stefan


Pandas stores strings in objects. In python 3, all string are in unicode by default. So if you use python 3, your data is already in unicode (don't be mislead by type object).

If you have python 2, then use df = pd.read_csv('your_file', encoding = 'utf8'). Then try for example pd.lib.infer_dtype(df.iloc[0,0]) (I guess the first col consists of strings.)

like image 26
ptrj Avatar answered Sep 23 '22 06:09

ptrj


Looks like the location of this function has moved. This worked for me on 1.0.1:

df.apply(lambda x: pd.api.types.infer_dtype(x.values))
like image 38
cefect Avatar answered Sep 21 '22 06:09

cefect