I am working in Python 2 and I have a string containing emojis as well as other unicode characters. I need to convert it to a list where each entry in the list is a single character/emoji. <pre class="prettyprint"><code>x = u'😘😘xyz😊😊' char_list = [c for c in x] </code></pre> The desired output is: <pre class="prettyprint"><code>['😘', '😘', 'x', 'y', 'z', '😊', '😊'] </code></pre> The actual output is: <pre class="prettyprint"><code>[u'\ud83d', u'\ude18', u'\ud83d', u'\ude18', u'x', u'y', u'z', u'\ud83d', u'\ude0a', u'\ud83d', u'\ude0a'] </code></pre> How can I achieve the desired output?

I would use the uniseg library (<code>pip install uniseg</code>): <pre class="prettyprint"><code># -*- coding: utf-8 -*- from uniseg import graphemecluster as gc print list(gc.grapheme_clusters(u'😘😘xyz😊😊')) </code></pre> outputs <code>[u'\U0001f618', u'\U0001f618', u'x', u'y', u'z', u'\U0001f60a', u'\U0001f60a']</code>, and <pre class="prettyprint"><code>[x.encode('utf-8') for x in gc.grapheme_clusters(u'😘😘xyz😊😊'))] </code></pre> will provide the list of characters as UTF-8 encoded strings.

Correctly extract Emojis from a Unicode string

Tags:

I am working in Python 2 and I have a string containing emojis as well as other unicode characters. I need to convert it to a list where each entry in the list is a single character/emoji.

x = u'😘😘xyz😊😊' char_list = [c for c in x]

The desired output is:

['😘', '😘', 'x', 'y', 'z', '😊', '😊']

The actual output is:

[u'\ud83d', u'\ude18', u'\ud83d', u'\ude18', u'x', u'y', u'z', u'\ud83d', u'\ude0a', u'\ud83d', u'\ude0a']

How can I achieve the desired output?

973

asked Feb 15 '16 07:02

Aaron

2 Answers

First of all, in Python2, you need to use Unicode strings (u'<...>') for Unicode characters to be seen as Unicode characters. And correct source encoding if you want to use the chars themselves rather than the \UXXXXXXXX representation in source code.

Now, as per Python: getting correct string length when it contains surrogate pairs and Python returns length of 2 for single Unicode character string, in Python2 "narrow" builds (with sys.maxunicode==65535), 32-bit Unicode characters are represented as surrogate pairs, and this is not transparent to string functions. This has only been fixed in 3.3 (PEP0393).

The simplest resolution (save for migrating to 3.3+) is to compile a Python "wide" build from source as outlined on the 3rd link. In it, Unicode characters are all 4-byte (thus are a potential memory hog) but if you need to routinely handle wide Unicode chars, this is probably an acceptable price.

The solution for a "narrow" build is to make a custom set of string functions (len, slice; maybe as a subclass of unicode) that would detect surrogate pairs and handle them as a single character. I couldn't readily find an existing one (which is strange), but it's not too hard to write:

as per UTF-16#U+10000 to U+10FFFF - Wikipedia,
- the 1st character (high surrogate) is in range 0xD800..0xDBFF
- the 2nd character (low surrogate) - in range 0xDC00..0xDFFF
- these ranges are reserved and thus cannot occur as regular characters

So here's the code to detect a surrogate pair:

def is_surrogate(s,i):     if 0xD800 <= ord(s[i]) <= 0xDBFF:         try:             l = s[i+1]         except IndexError:             return False         if 0xDC00 <= ord(l) <= 0xDFFF:             return True         else:             raise ValueError("Illegal UTF-16 sequence: %r" % s[i:i+2])     else:         return False

And a function that returns a simple slice:

def slice(s,start,end):     l=len(s)     i=0     while i<start and i<l:         if is_surrogate(s,i):             start+=1             end+=1             i+=1         i+=1     while i<end and i<l:         if is_surrogate(s,i):             end+=1             i+=1         i+=1     return s[start:end]

Here, the price you pay is performance, as these functions are much slower than built-ins:

>>> ux=u"a"*5000+u"\U00100000"*30000+u"b"*50000 >>> timeit.timeit('slice(ux,10000,100000)','from __main__ import slice,ux',number=1000) 46.44128203392029    #msec >>> timeit.timeit('ux[10000:100000]','from __main__ import slice,ux',number=1000000) 8.814016103744507    #usec

155

answered Sep 21 '22 10:09

ivan_pozdeev

I would use the uniseg library (pip install uniseg):

# -*- coding: utf-8 -*- from uniseg import graphemecluster as gc  print list(gc.grapheme_clusters(u'😘😘xyz😊😊'))

outputs [u'\U0001f618', u'\U0001f618', u'x', u'y', u'z', u'\U0001f60a', u'\U0001f60a'], and

[x.encode('utf-8') for x in gc.grapheme_clusters(u'😘😘xyz😊😊'))]

will provide the list of characters as UTF-8 encoded strings.

answered Sep 23 '22 10:09

James Hopkin

Related questions
                            
                                SOAP in .NET Core?
                            
                                JPA mapping @ManyToOne between Embeddable and EmbeddedId
                            
                                How can I retrieve a row by index value from a Pandas DataFrame?
                            
                                How to make GCC generate bswap instruction for big endian store without builtins?
                            
                                SFINAE works differently in cases of type and non-type template parameters
                            
                                Creating the most basic Scala project with Maven?
                            
                                Amazon Alexa: store user's words
                            
                                Can I get (the new) bash on Windows 10 to load .profile [closed]
                            
                                What are the type conversion rules for parameters and return values of lambdas?
                            
                                recyclerView.setAdapter does not accept ArrayAdapter
                            
                                How to load tests in ghci with stack
                            
                                Realm access from incorrect thread

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With