Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I properly parse a string into grapheme clusters using python 2.7?

I'm trying to find a way using python 2.7.1 to parse a string into grapheme clusters. For example, the string:

details = u"Hello 🇦🇹🇻🇪"

I believe should be parsed as:

[u"H", u"e", u"l", u"l", u"o", u"\U0001f1e6\U0001f1f9", u"\U0001f1fb\U0001f1ea"]

I was using grapheme_clusters from the uniseg library, but this produces:

[u"H", u"e", u"l", u"l", u"o", u"\U0001f1e6\U0001f1f9\U0001f1fb\U0001f1ea"]

I have a hard requirement on using 2.7.1. I know Unicode support is better in python 3.X.

  1. Is my interpretation of the way this string should be parsed correct?
  2. Is there a way to do this with python 2.7?
  3. Is this actually any easier in python 3.X?
like image 245
Eric Conner Avatar asked Oct 30 '25 11:10

Eric Conner


1 Answers

This behavior used to be correct, but the rules changed.

As of uniseg version 0.7.1 (current as of this post), the uniseg documentation refers to an outdated version of the Unicode grapheme cluster boundary rules, given in Unicode Standard Annex #29 Version 21. This version includes the rule

Do not break between regional indicator symbols.

where the most recent version of the Unicode grapheme cluster boundary rules, given in Unicode Standard Annex #29 Version 29 says

Do not break within emoji flag sequences. That is, do not break between regional indicator (RI) symbols if there is an odd number of RI characters before the break point.

You could file a bug report with uniseg, or perhaps a different library would have a more up-to-date implementation. The uniseg bitbucket page links to a few alternatives, such as PyICU.

like image 182
user2357112 supports Monica Avatar answered Nov 01 '25 01:11

user2357112 supports Monica