Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I convert a unicode string into string literals in Python 2.7?

Python2.7: I would like to do something unusual. Most people want to convert string literals to more human-readable strings. I would like to convert the following list of unicode strings into their literal forms:

  • hallöchen
  • Straße
  • Gemüse
  • freø̯̯nt

to their codepoint forms that look something like this:

\u3023\u2344

You'll notice that freø̯̯nt has two inverted breves below the ø. I would like to convert especially that word into its literal form, such that I can use REGEX to remove the extra breve.

I am not sure what the terminology is for these things—please correct me if I am mistaken.

like image 768
Jonathan Komar Avatar asked Dec 25 '13 16:12

Jonathan Komar


People also ask

How do you convert a Unicode character to a string in Python?

To convert Python Unicode to string, use the unicodedata. normalize() function. The Unicode standard defines various normalization forms of a Unicode string, based on canonical equivalence and compatibility equivalence.

How do you convert a string to a literal in Python?

In Python, when you prefix a string with the letter r or R such as r'...' and R'...' , that string becomes a raw string. Unlike a regular string, a raw string treats the backslashes ( \ ) as literal characters.

How do I remove Unicode from a string in Python?

In python, to remove Unicode ” u “ character from string then, we can use the replace() method to remove the Unicode ” u ” from the string. After writing the above code (python remove Unicode ” u ” from a string), Ones you will print “ string_unicode ” then the output will appear as a “ Python is easy. ”.


Video Answer


2 Answers

You can use the str.encode([encoding[, errors]]) function with the unicode_escape encoding:

>>> s = u'freø̯̯nt'
>>> print(s.encode('unicode_escape'))
b'fre\\xf8\\u032f\\u032fnt'
like image 129
Anym Avatar answered Sep 26 '22 13:09

Anym


You'll notice that freø̯̯nt has two inverted breves below the ø. I would like to convert especially that word into its literal form, such that I can use REGEX to remove the extra breve.

You don't need codecs.encode(unicode_string, 'unicode-escape') in this case. There are no string literals in memory only string objects.

Unicode string is a sequence of Unicode codepoints in Python. The same user-perceived characters can be written using different codepoints e.g., 'Ç' could be written as u'\u00c7' and u'\u0043\u0327'.

You could use NFKD Unicode normalization form to make sure "breves" are separate in order not to miss them when they are duplicated:

#!/usr/bin/env python
# -*- coding: utf-8 -*-
import re
import unicodedata

s = u"freø̯̯nt"
# remove consecutive duplicate "breves"
print(re.sub(u'\u032f+', u'\u032f', unicodedata.normalize('NFKD', s)))

Could you explain why your re.sub command does not have any +1 for ensuring that the breves are consecutive characters? (like @Paulo Freitas's answer)

re.sub('c+', 'c', text) makes sure that there are no 'cc', 'ccc', 'cccc', etc in the text. Sometimes the regex does unnecessary work by replacing 'c' with 'c'. But the result is the same: no consecutive duplicate 'c' in the text.

The regex from @Paulo Freitas's answer should also work:

no_duplicates = re.sub(u'(\u032f)\\1+', r'\1', unicodedata.normalize('NFKD', s))

It performs the replacement only for duplicates. You can measure time performance and see what regex runs faster if it is a bottleneck in your application.

like image 39
jfs Avatar answered Sep 26 '22 13:09

jfs