Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Remove Right-to-left character \u200f in Python (Hebrew)

I have a dataframe, and I want the unique strings of a specific column. The strings are in Hebrew.

Because I'm using pandas dataframe, I wrote: all_names = history.name.unique() (history is the data frame with a name column).

I get strange duplicates with the \u200f character. Like ערן and another one with the \u200f

all_names
array(['\u200fערן', 'ערן',  ...., None], dtype=object)

How can I remove these characters? (From the original data frame)

like image 967
sheldonzy Avatar asked Oct 24 '25 18:10

sheldonzy


1 Answers

You can clear-up your name strings by filtering out all non-letters and non-whitespaces (Unicode-wise) by applying a re.sub-based function to all the values in the name column.

For example (assuming Python 3, which handles Unicode properly):

>>> import re
>>> history.name.apply(lambda s: s and re.sub('[^\w\s]', '', s))

The \w includes all Unicode word characters (including numbers) and \s includes all Unicode whitespace characters.

By the way, the \u200f (aka the RIGHT-TO-LEFT MARK) that's bothering you is in the Unicode codepoint category "Other, Format":

>>> import unicodedata
>>> unicodedata.name('\u200f')
'RIGHT-TO-LEFT MARK'
>>> unicodedata.category('\u200f')
'Cf'

so, you can be sure it'll be removed with the filter above.

like image 195
randomir Avatar answered Oct 26 '25 08:10

randomir



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!