Python2.7: I would like to do something unusual. Most people want to convert string literals to more human-readable strings. I would like to convert the following list of unicode strings into their literal forms:
to their codepoint forms that look something like this:
\u3023\u2344
You'll notice that freø̯̯nt has two inverted breves below the ø. I would like to convert especially that word into its literal form, such that I can use REGEX to remove the extra breve.
I am not sure what the terminology is for these things—please correct me if I am mistaken.
To convert Python Unicode to string, use the unicodedata. normalize() function. The Unicode standard defines various normalization forms of a Unicode string, based on canonical equivalence and compatibility equivalence.
In Python, when you prefix a string with the letter r or R such as r'...' and R'...' , that string becomes a raw string. Unlike a regular string, a raw string treats the backslashes ( \ ) as literal characters.
In python, to remove Unicode ” u “ character from string then, we can use the replace() method to remove the Unicode ” u ” from the string. After writing the above code (python remove Unicode ” u ” from a string), Ones you will print “ string_unicode ” then the output will appear as a “ Python is easy. ”.
You can use the str.encode([encoding[, errors]]) function with the unicode_escape encoding:
>>> s = u'freø̯̯nt'
>>> print(s.encode('unicode_escape'))
b'fre\\xf8\\u032f\\u032fnt'
You'll notice that freø̯̯nt has two inverted breves below the ø. I would like to convert especially that word into its literal form, such that I can use REGEX to remove the extra breve.
You don't need codecs.encode(unicode_string, 'unicode-escape')
in this case. There are no string literals in memory only string objects.
Unicode string is a sequence of Unicode codepoints in Python. The same user-perceived characters can be written using different codepoints e.g., 'Ç'
could be written as u'\u00c7'
and u'\u0043\u0327'
.
You could use NFKD
Unicode normalization form to make sure "breves" are separate in order not to miss them when they are duplicated:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import re
import unicodedata
s = u"freø̯̯nt"
# remove consecutive duplicate "breves"
print(re.sub(u'\u032f+', u'\u032f', unicodedata.normalize('NFKD', s)))
Could you explain why your re.sub command does not have any +1 for ensuring that the breves are consecutive characters? (like @Paulo Freitas's answer)
re.sub('c+', 'c', text)
makes sure that there are no 'cc', 'ccc', 'cccc', etc in the text. Sometimes the regex does unnecessary work by replacing 'c'
with 'c'
. But the result is the same: no consecutive duplicate 'c'
in the text.
The regex from @Paulo Freitas's answer should also work:
no_duplicates = re.sub(u'(\u032f)\\1+', r'\1', unicodedata.normalize('NFKD', s))
It performs the replacement only for duplicates. You can measure time performance and see what regex runs faster if it is a bottleneck in your application.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With