If I have a Python Unicode string that contains combining characters, <code>len</code> reports a value that does not correspond to the number of characters "seen". For example, if I have a string with combining overlines and underlines such as <code>u'A\u0332\u0305BC'</code>, <code>len(u'A\u0332\u0305BC')</code> reports 5; but the displayed string is only 3 characters long. How do I get the "visible" — that is, number of distinct positions occupied by the string the user sees — length of a Unicode string containing combining glyphs in Python?

The <code>unicodedata</code> module has a function <code>combining</code> that can be used to determine if a single character is a combining character. If it returns <code>0</code> you can count the character as non-combining. <pre class="prettyprint"><code>import unicodedata len(u''.join(ch for ch in u'A\u0332\u0305BC' if unicodedata.combining(ch) == 0)) </code></pre> or, slightly simpler: <pre class="prettyprint"><code>sum(1 for ch in u'A\u0332\u0305BC' if unicodedata.combining(ch) == 0) </code></pre>

If you have a regex flavor that supports matching grapheme, you can use <code>\X</code> Demo While the default Python re module does not support <code>\X</code>, Matthew Barnett's regex module does: <pre class="prettyprint"><code>>>> len(regex.findall(r'\X', u'A\u0332\u0305BC')) 3 </code></pre> On Python 2, you need to use <code>u</code> in the pattern: <pre class="prettyprint"><code>>>> regex.findall(u'\\X', u'A\u0332\u0305BC') [u'A\u0332\u0305', u'B', u'C'] >>> len(regex.findall(u'\\X', u'A\u0332\u0305BC')) 3 </code></pre>

Combining characters are not the only zero-width characters: <pre class="prettyprint"><code>>>> sum(1 for ch in u'\u200c' if unicodedata.combining(ch) == 0) 1 </code></pre> (<code>"\u200c"</code> or <code>"&zwnj;"</code> is zero-width non-joiner; it's a non-printing character.) In this case the regex module does not work either: <pre class="prettyprint"><code>>>> len(regex.findall(r'\X', u'\u200c')) 1 </code></pre> I found wcwidth that handles the above case correctly: <pre class="prettyprint"><code>>>> from wcwidth import wcswidth >>> wcswidth(u'A\u0332\u0305BC') 3 >>> wcswidth(u'\u200c') 0 </code></pre> But still doesn't seem to work with user 596219's example: <pre class="prettyprint"><code>>>> wcswidth('각') 4 </code></pre>

How do I get the "visible" length of a combining Unicode string in Python?

3 Answers

The unicodedata module has a function combining that can be used to determine if a single character is a combining character. If it returns 0 you can count the character as non-combining.

import unicodedata
len(u''.join(ch for ch in u'A\u0332\u0305BC' if unicodedata.combining(ch) == 0))

or, slightly simpler:

sum(1 for ch in u'A\u0332\u0305BC' if unicodedata.combining(ch) == 0)

answered Nov 05 '22 03:11

Mark Ransom

If you have a regex flavor that supports matching grapheme, you can use \X

Demo

While the default Python re module does not support \X, Matthew Barnett's regex module does:

>>> len(regex.findall(r'\X', u'A\u0332\u0305BC'))
3

On Python 2, you need to use u in the pattern:

>>> regex.findall(u'\\X', u'A\u0332\u0305BC')
[u'A\u0332\u0305', u'B', u'C']
>>> len(regex.findall(u'\\X', u'A\u0332\u0305BC'))
3

answered Nov 05 '22 04:11

dawg

Combining characters are not the only zero-width characters:

>>> sum(1 for ch in u'\u200c' if unicodedata.combining(ch) == 0)
1

("\u200c" or "‌" is zero-width non-joiner; it's a non-printing character.)

In this case the regex module does not work either:

>>> len(regex.findall(r'\X', u'\u200c'))
1

I found wcwidth that handles the above case correctly:

>>> from wcwidth import wcswidth
>>> wcswidth(u'A\u0332\u0305BC')
3
>>> wcswidth(u'\u200c')
0

But still doesn't seem to work with user 596219's example:

>>> wcswidth('각')
4

answered Nov 05 '22 04:11

AXO

Related questions
                            
                                What scipy statistical test do I use to compare sample means?
                            
                                Modify a Python "requests" module response object
                            
                                How to manipulate the exception in __exit__ of a context manager?
                            
                                Parsing Thread-Index Mail Header with Python
                            
                                Internals for python tuples
                            
                                Why does `datetime.strptime` get an incorrect date for Tuesday in the week 0 of 2015?
                            
                                what is the difference between "next" and "until" in pdb
                            
                                Deterministic hashing in Python 3
                            
                                Typical memory usage for Django applications
                            
                                Python Take first observation per group Using pandas.pivot_table
                            
                                Python restarting windows services
                            
                                Mock a HTTP request that times out with HTTPretty
                            
                                Multikey Multivalue Non Deterministic python dictionary
                            
                                pip and virtualenv installing parts of django in the wrong place
                            
                                Unable to return a tuple when mocking a function
                            
                                How to append a dictionary to a pandas dataframe?
                            
                                How do I columnwise reduce a pandas dataframe?
                            
                                Duplicate Django Model Instance and All Foreign Keys Pointing to It
                            
                                Python: understanding iterators and `join()` better
                            
                                Why does open(True, 'w') print the text like sys.stdout.write?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How do I get the "visible" length of a combining Unicode string in Python?

Tags:

python

unicode

python-2.7

orome

People also ask

3 Answers

Mark Ransom

dawg

AXO

Recent Activity

Donate For Us