I have a dataframe, and I want the unique strings of a specific column. The strings are in Hebrew.
Because I'm using pandas dataframe, I wrote: all_names = history.name.unique() (history is the data frame with a name column).
I get strange duplicates with the \u200f character. Like ערן and another one with the \u200f
all_names
array(['\u200fערן', 'ערן', ...., None], dtype=object)
How can I remove these characters? (From the original data frame)
You can clear-up your name strings by filtering out all non-letters and non-whitespaces (Unicode-wise) by applying a re.sub-based function to all the values in the name column.
For example (assuming Python 3, which handles Unicode properly):
>>> import re
>>> history.name.apply(lambda s: s and re.sub('[^\w\s]', '', s))
The \w includes all Unicode word characters (including numbers) and \s includes all Unicode whitespace characters.
By the way, the \u200f (aka the RIGHT-TO-LEFT MARK) that's bothering you is in the Unicode codepoint category "Other, Format":
>>> import unicodedata
>>> unicodedata.name('\u200f')
'RIGHT-TO-LEFT MARK'
>>> unicodedata.category('\u200f')
'Cf'
so, you can be sure it'll be removed with the filter above.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With