Python2.7: I would like to do something unusual. Most people want to convert string literals to more human-readable strings. I would like to convert the following list of unicode strings into their literal forms: <ul> <li>hallöchen</li> <li>Straße</li> <li>Gemüse</li> <li>freø̯̯nt</li> </ul> to their codepoint forms that look something like this: \u3023\u2344 You'll notice that freø̯̯nt has two inverted breves below the ø. I would like to convert especially that word into its literal form, such that I can use REGEX to remove the extra breve. I am not sure what the terminology is for these things—please correct me if I am mistaken.

You can use the str.encode([encoding[, errors]]) function with the unicode_escape encoding: <pre class="prettyprint"><code>>>> s = u'freø̯̯nt' >>> print(s.encode('unicode_escape')) b'fre\\xf8\\u032f\\u032fnt' </code></pre>

How can I convert a unicode string into string literals in Python 2.7?

Tags:

python

string

regex

unicode

python-2.7

Python2.7: I would like to do something unusual. Most people want to convert string literals to more human-readable strings. I would like to convert the following list of unicode strings into their literal forms:

hallöchen
Straße
Gemüse
freø̯̯nt

to their codepoint forms that look something like this:

\u3023\u2344

You'll notice that freø̯̯nt has two inverted breves below the ø. I would like to convert especially that word into its literal form, such that I can use REGEX to remove the extra breve.

I am not sure what the terminology is for these things—please correct me if I am mistaken.

768

asked Dec 25 '13 16:12

Jonathan Komar

Video Answer

2 Answers

You can use the str.encode([encoding[, errors]]) function with the unicode_escape encoding:

>>> s = u'freø̯̯nt'
>>> print(s.encode('unicode_escape'))
b'fre\\xf8\\u032f\\u032fnt'

129

answered Sep 26 '22 13:09

Anym

You'll notice that freø̯̯nt has two inverted breves below the ø. I would like to convert especially that word into its literal form, such that I can use REGEX to remove the extra breve.

You don't need codecs.encode(unicode_string, 'unicode-escape') in this case. There are no string literals in memory only string objects.

Unicode string is a sequence of Unicode codepoints in Python. The same user-perceived characters can be written using different codepoints e.g., 'Ç' could be written as u'\u00c7' and u'\u0043\u0327'.

You could use NFKD Unicode normalization form to make sure "breves" are separate in order not to miss them when they are duplicated:

#!/usr/bin/env python
# -*- coding: utf-8 -*-
import re
import unicodedata

s = u"freø̯̯nt"
# remove consecutive duplicate "breves"
print(re.sub(u'\u032f+', u'\u032f', unicodedata.normalize('NFKD', s)))

Could you explain why your re.sub command does not have any +1 for ensuring that the breves are consecutive characters? (like @Paulo Freitas's answer)

re.sub('c+', 'c', text) makes sure that there are no 'cc', 'ccc', 'cccc', etc in the text. Sometimes the regex does unnecessary work by replacing 'c' with 'c'. But the result is the same: no consecutive duplicate 'c' in the text.

The regex from @Paulo Freitas's answer should also work:

no_duplicates = re.sub(u'(\u032f)\\1+', r'\1', unicodedata.normalize('NFKD', s))

It performs the replacement only for duplicates. You can measure time performance and see what regex runs faster if it is a bottleneck in your application.