I can read a csv file in which there is a column containing Chinese characters (other columns are English and numbers). However, Chinese characters don't display correctly. see photo below
I loaded the csv file with pd.read_csv()
.
Either display(data06_16)
or data06_16.head()
won't display Chinese characters correctly.
I tried to add the following lines into my .bash_profile
:
export LC_ALL=zh_CN.UTF-8
export LANG=zh_CN.UTF-8
export LC_ALL=en_US.UTF-8
export LANG=en_US.UTF-8
but it doesn't help.
Also I have tried to add encoding
arg to pd.read_csv()
:
pd.read_csv('data.csv', encoding='utf_8')
pd.read_csv('data.csv', encoding='utf_16')
pd.read_csv('data.csv', encoding='utf_32')
These won't work at all.
How can I display the Chinese characters properly?
dtype : Use a numpy. dtype or Python type to cast entire pandas object to the same type. Alternatively, use {col: dtype, …}, where col is a column label and dtype is a numpy. dtype or Python type to cast one or more of the DataFrame's columns to column-specific types.
parse_dates : boolean or list of ints or names or list of lists or dict, default False. boolean. If True -> try parsing the index. list of ints or names. e.g. If [1, 2, 3] -> try parsing columns 1, 2, 3 each as a separate date column.
orient: String value, ('dict', 'list', 'series', 'split', 'records', 'index') Defines which dtype to convert Columns(series into). For example, 'list' would return a dictionary of lists with Key=Column name and Value=List (Converted series).
Pandas str. find() method is used to search a substring in each string present in a series. If the string is found, it returns the lowest index of its occurrence. If string is not found, it will return -1.
I just remembered that the source dataset was created using encoding='GBK'
, so I tried again using
data06_16 = pd.read_csv("../data/stocks1542monthly.csv", encoding="GBK")
Now, I can see all the Chinese characters.
Thanks guys!
I see here three possible issues:
1) You can try this:
import codecs
x = codecs.open("testdata.csv", "r", "utf-8")
2) Another possibility can be theoretically this:
import pandas as pd
df = pd.DataFrame(pd.read_csv('testdata.csv',encoding='utf-8'))
3) Maybe you should convert you csv file into utf-8 before importing with Python (for example in Notepad++)? It can be a solution for one-time-import, not for automatic process, of course.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With