Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Special text to latin characters in python

Tags:

python

pandas

I have the following pandas data frame:

the_df = pd.DataFrame({'id':[1,2],'name':['Joe','๐’ฎ๐’ถ๐“‡๐’ถ๐’ฝ']})
the_df
    id  name
0   1   Joe
1   2   ๐’ฎ๐’ถ๐“‡๐’ถ๐’ฝ

As you can see, we can read the second name as "Sarah", but it's written with special characters.

I want to create a new column with these characters converted to latin characters. I have tried this approach:

the_df['latin_name'] = the_df['name'].str.extract(r'(^[a-zA-Z\s]*)')
the_df
    id  name    latin_name
0   1   Joe     Joe
1   2   ๐’ฎ๐’ถ๐“‡๐’ถ๐’ฝ  

But it doesn't recognize the letters. Please, any help on this will be greatly appreciated.

like image 599
Alexis Avatar asked Aug 05 '21 17:08

Alexis


People also ask

What are Latin special characters?

The Latin-1 characters with numerical codes above 127 are mostly accented letters used in various European languages: c cedilla ( รง ), e grave ( รจ ), n tilde ( รฑ ), u umlaut ( รผ ), and such. These are needed for writing in French, German, Spanish, etc.

How do I remove Unicode characters from a string in Python?

In python, to remove Unicode character from string python we need to encode the string by using str. encode() for removing the Unicode characters from the string.


2 Answers

Try .str.normalize

the_df['name'].str.normalize('NFKC').str.extract(r'(^[a-zA-Z\s]*)')

Output:

       0
0    Joe
1  Sarah
like image 71
Scott Boston Avatar answered Oct 09 '22 00:10

Scott Boston


You can use unicodedata.normalize:

>>> import unicodedata
>>> df['name'].apply(lambda x: unicodedata.normalize('NFKD', x))
0      Joe
1    Sarah
Name: name, dtype: object
like image 2
ThePyGuy Avatar answered Oct 09 '22 00:10

ThePyGuy