How can I properly parse a string into grapheme clusters using python 2.7?

Question

I'm trying to find a way using python 2.7.1 to parse a string into grapheme clusters. For example, the string:

details = u"Hello 🇦🇹🇻🇪"

I believe should be parsed as:

[u"H", u"e", u"l", u"l", u"o", u"\U0001f1e6\U0001f1f9", u"\U0001f1fb\U0001f1ea"]

I was using grapheme_clusters from the uniseg library, but this produces:

[u"H", u"e", u"l", u"l", u"o", u"\U0001f1e6\U0001f1f9\U0001f1fb\U0001f1ea"]

I have a hard requirement on using 2.7.1. I know Unicode support is better in python 3.X.

Is my interpretation of the way this string should be parsed correct?
Is there a way to do this with python 2.7?
Is this actually any easier in python 3.X?

user2357112 supports Monica · Accepted Answer

This behavior used to be correct, but the rules changed.

As of uniseg version 0.7.1 (current as of this post), the uniseg documentation refers to an outdated version of the Unicode grapheme cluster boundary rules, given in Unicode Standard Annex #29 Version 21. This version includes the rule

Do not break between regional indicator symbols.

where the most recent version of the Unicode grapheme cluster boundary rules, given in Unicode Standard Annex #29 Version 29 says

Do not break within emoji flag sequences. That is, do not break between regional indicator (RI) symbols if there is an odd number of RI characters before the break point.

You could file a bug report with uniseg, or perhaps a different library would have a more up-to-date implementation. The uniseg bitbucket page links to a few alternatives, such as PyICU.

How can I properly parse a string into grapheme clusters using python 2.7?

Tags:

python

python-3.x

unicode

python-2.7

Eric Conner

1 Answers

user2357112 supports Monica

Recent Activity

Donate For Us

How can I properly parse a string into grapheme clusters using python 2.7?

Tags:

python

python-3.x

unicode

python-2.7

Eric Conner

1 Answers

user2357112 supports Monica

Related questions

Recent Activity

Donate For Us