Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Conversion utf to ascii in python with pandas dataframe

I am trying to convert datas in DataFrame of unicode words into ASCII into a new column with certain character changes...

characterMap = {u'\u00E7': 'c', u'\u00C7' : 'C', u'\u011F' : 'g', u'\u011E' : 'G', u'\u00F6': 'o', u'\u00D6' : 'O', u'\u015F' : 's', u'\u015E' : 'S', u'\u00FC' : 'u', u'\u00DC' : 'U' , u'\u0131' : 'i', u'\u0049' : 'I', u'\u0259' : 'e', u'\u018F' : 'E'}

def convertASCII(word):
    asciiWord = ""
    word = str(word).rstrip()
    for c in word:
        if c in characterMap.keys():
            asciiWord = asciiWord + characterMap[c]
        else:
            asciiWord = asciiWord + c
    return asciiWord;

test['ascii'] = test['token'].apply(convertASCII)

So say the result should look something like this...

               token         ascii
1555757    qurbangaha    qurbangaha
379221          saylı         sayli
2456599      öhdəliyi      ohdeliyi
1128903            ki            ki
467997         ilişib        ilisib

However, the ASCII column is just a repetition of the token column instead of the desired result above? I have ran manually the convertASCII code on another script, it does what I want it to, but not sure what is the bug with pandas?

like image 447
Bao Thai Avatar asked Apr 18 '18 05:04

Bao Thai


People also ask

How do you convert a DataFrame data type?

The best way to convert one or more columns of a DataFrame to numeric values is to use pandas.to_numeric() . This function will try to change non-numeric objects (such as strings) into integers or floating-point numbers as appropriate.

How can I get single value as a string from pandas data frame?

Try using t = df[df['Host'] == 'a']['Port'][0] or t = df[df['Host'] == 'a']['Port'][1] .

How do you convert a whole DataFrame to a string?

If you want to change the data type for all columns in the DataFrame to the string type, you can use df. applymap(str) or df. astype(str) methods.


1 Answers

If the unicode conversion you are trying to do is standard then you can directly convert to ascii.

import unicodedata

test['ascii'] = test['token'].apply(lambda val: unicodedata.normalize('NFKD', val).encode('ascii', 'ignore').decode())

Example:

import unicodedata
data = [{'name': 'saylı'}, {'name': 'öhdəliyi'}]
df = pd.DataFrame.from_dict(data, orient='columns')
df['name'].apply(lambda val: unicodedata.normalize('NFKD', val).encode('ascii', 'ignore').decode())

output:

0       sayl
1    ohdliyi
like image 166
Vikash Singh Avatar answered Sep 17 '22 01:09

Vikash Singh