Python2.7: I would like to do something unusual. Most people want to convert string literals to more human-readable strings. I would like to convert the following list of unicode strings into their literal forms:
to their codepoint forms that look something like this:
\u3023\u2344
You'll notice that freø̯̯nt has two inverted breves below the ø. I would like to convert especially that word into its literal form, such that I can use REGEX to remove the extra breve.
I am not sure what the terminology is for these things—please correct me if I am mistaken.
To convert Python Unicode to string, use the unicodedata. normalize() function. The Unicode standard defines various normalization forms of a Unicode string, based on canonical equivalence and compatibility equivalence.
In Python, when you prefix a string with the letter r or R such as r'...' and R'...' , that string becomes a raw string. Unlike a regular string, a raw string treats the backslashes ( \ ) as literal characters.
In python, to remove Unicode ” u “ character from string then, we can use the replace() method to remove the Unicode ” u ” from the string. After writing the above code (python remove Unicode ” u ” from a string), Ones you will print “ string_unicode ” then the output will appear as a “ Python is easy. ”.
You can use the str.encode([encoding[, errors]]) function with the unicode_escape encoding:
>>> s = u'freø̯̯nt'
>>> print(s.encode('unicode_escape'))
b'fre\\xf8\\u032f\\u032fnt'
You'll notice that freø̯̯nt has two inverted breves below the ø. I would like to convert especially that word into its literal form, such that I can use REGEX to remove the extra breve.
You don't need codecs.encode(unicode_string, 'unicode-escape') in this case. There are no string literals in memory only string objects.
Unicode string is a sequence of Unicode codepoints in Python. The same user-perceived characters can be written using different codepoints e.g., 'Ç' could be written as u'\u00c7' and u'\u0043\u0327'.
You could use NFKD Unicode normalization form to make sure "breves" are separate in order not to miss them when they are duplicated:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import re
import unicodedata
s = u"freø̯̯nt"
# remove consecutive duplicate "breves"
print(re.sub(u'\u032f+', u'\u032f', unicodedata.normalize('NFKD', s)))
Could you explain why your re.sub command does not have any +1 for ensuring that the breves are consecutive characters? (like @Paulo Freitas's answer)
re.sub('c+', 'c', text) makes sure that there are no 'cc', 'ccc', 'cccc', etc in the text. Sometimes the regex does unnecessary work by replacing 'c' with 'c'. But the result is the same: no consecutive duplicate 'c' in the text.
The regex from @Paulo Freitas's answer should also work:
no_duplicates = re.sub(u'(\u032f)\\1+', r'\1', unicodedata.normalize('NFKD', s))
It performs the replacement only for duplicates. You can measure time performance and see what regex runs faster if it is a bottleneck in your application.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With