Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to read a dataframe of encoded strings from csv in python

Suppose I read an html website and I get a list of names, such as: 'Amiel, Henri-Frédéric'.

In order to get the list of names I decode the html using the following code:

f = urllib.urlopen("http://xxx.htm")
html = f.read()
html=html.decode('utf8')
t.feed(html)
t.close()
lista=t.data

At this point, the variable lista contains a list of names like:

[u'Abatantuono, Diego', ... , u'Amiel, Henri-Frédéric']

Now I would like to:

  1. put these names inside a DataFrame;
  2. save the DataFrame in a csv file;
  3. read the csv in Python through a DataFrame

For simplicity, let's take in consideration just the above name to complete steps 1 to 3. I would use the following code:

name=u'Amiel, Henri-Fr\xe9d\xe9ric'
name=name.encode('utf8')
array=[name]
df=pd.DataFrame({'Names':array})
df.to_csv('names')
uni=pd.read_csv('names')
uni #trying to read the csv file in a DataFrame

At this point i get the following error:

UnicodeDecodeError: 'utf8' codec can't decode byte 0xe9 in position 67: invalid continuation byte      

If I substitute the last row of the above code with:

print uni

I can read the DataFrame but I don't think it is the right way to handle this issue.

I red many questions posted by other users about this argument but I didn't get to solve this one.

like image 427
fabrizio_ff Avatar asked Mar 25 '13 07:03

fabrizio_ff


People also ask

How do I extract a DataFrame from a CSV file?

Exporting the DataFrame into a CSV filePandas DataFrame to_csv() function exports the DataFrame to CSV format. If a file argument is provided, the output will be the CSV file. Otherwise, the return value is a CSV format like string. sep: Specify a custom delimiter for the CSV output, the default is a comma.

How do I read a specific column in a CSV file in pandas?

Use pandas. read_csv() to read a specific column from a CSV file. To read a CSV file, call pd. read_csv(file_name, usecols=cols_list) with file_name as the name of the CSV file, delimiter as the delimiter, and cols_list as the list of specific columns to read from the CSV file.


1 Answers

Bothto_csv method and read_csv function take an encoding argument. Use it. And work with unicode internally. If you don't, trying to encode/decode inside your program will get you.

import pandas as pd

name = u'Amiel, Henri-Fr\xe9d\xe9ric'
array = [name]
df = pd.DataFrame({'Names':array})
df.to_csv('names', encoding='utf-8')
uni = pd.read_csv('names', index_col = [0], encoding='utf-8')
print uni  # for me it works with or without print

out:

                   Names
0  Amiel, Henri-Frédéric
like image 183
root Avatar answered Oct 08 '22 22:10

root