I have the following pandas data frame:
the_df = pd.DataFrame({'id':[1,2],'name':['Joe','๐ฎ๐ถ๐๐ถ๐ฝ']})
the_df
id name
0 1 Joe
1 2 ๐ฎ๐ถ๐๐ถ๐ฝ
As you can see, we can read the second name as "Sarah", but it's written with special characters.
I want to create a new column with these characters converted to latin characters. I have tried this approach:
the_df['latin_name'] = the_df['name'].str.extract(r'(^[a-zA-Z\s]*)')
the_df
id name latin_name
0 1 Joe Joe
1 2 ๐ฎ๐ถ๐๐ถ๐ฝ
But it doesn't recognize the letters. Please, any help on this will be greatly appreciated.
The Latin-1 characters with numerical codes above 127 are mostly accented letters used in various European languages: c cedilla ( รง ), e grave ( รจ ), n tilde ( รฑ ), u umlaut ( รผ ), and such. These are needed for writing in French, German, Spanish, etc.
In python, to remove Unicode character from string python we need to encode the string by using str. encode() for removing the Unicode characters from the string.
Try .str.normalize
the_df['name'].str.normalize('NFKC').str.extract(r'(^[a-zA-Z\s]*)')
Output:
0
0 Joe
1 Sarah
You can use unicodedata.normalize
:
>>> import unicodedata
>>> df['name'].apply(lambda x: unicodedata.normalize('NFKD', x))
0 Joe
1 Sarah
Name: name, dtype: object
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With