I have a UTF-8 file with twitter data and I am trying to read it into a Python data frame but I can only get an 'object' type instead of unicode strings:
# file 1459966468_324.csv
#1459966468_324.csv: UTF-8 Unicode English text
df = pd.read_csv('1459966468_324.csv', dtype={'text': unicode})
df.dtypes
text object
Airline object
name object
retweet_count float64
sentiment object
tweet_location object
dtype: object
What is the right way of reading and coercing UTF-8 data into unicode with Pandas?
This does not solve the problem:
df = pd.read_csv('1459966468_324.csv', encoding = 'utf8')
df.apply(lambda x: pd.lib.infer_dtype(x.values))
Text file is here: https://raw.githubusercontent.com/l1x/nlp/master/1459966468_324.csv
str. encode() function is used to encode character string in the Series/Index using indicated encoding.
We can read data from a text file using read_table() in pandas. This function reads a general delimited file to a DataFrame object. This function is essentially the same as the read_csv() function but with the delimiter = '\t', instead of a comma by default.
As the other poster mentioned, you might try:
df = pd.read_csv('1459966468_324.csv', encoding='utf8')
However this could still leave you looking at 'object' when you print the dtypes. To confirm they are utf8, try this line after reading the CSV:
df.apply(lambda x: pd.lib.infer_dtype(x.values))
Example output:
args unicode date datetime64 host unicode kwargs unicode operation unicode
Use the encoding
keyword with the appropriate parameter:
df = pd.read_csv('1459966468_324.csv', encoding='utf8')
Pandas stores strings in object
s. In python 3, all string are in unicode by default. So if you use python 3, your data is already in unicode (don't be mislead by type object
).
If you have python 2, then use df = pd.read_csv('your_file', encoding = 'utf8')
. Then try for example pd.lib.infer_dtype(df.iloc[0,0])
(I guess the first col consists of strings.)
Looks like the location of this function has moved. This worked for me on 1.0.1:
df.apply(lambda x: pd.api.types.infer_dtype(x.values))
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With