Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do I get the "visible" length of a combining Unicode string in Python?

If I have a Python Unicode string that contains combining characters, len reports a value that does not correspond to the number of characters "seen".

For example, if I have a string with combining overlines and underlines such as u'A\u0332\u0305BC', len(u'A\u0332\u0305BC') reports 5; but the displayed string is only 3 characters long.

How do I get the "visible" — that is, number of distinct positions occupied by the string the user sees — length of a Unicode string containing combining glyphs in Python?

like image 678
orome Avatar asked Oct 26 '15 17:10

orome


People also ask

How do I compare two Unicode strings in Python?

Use == and != Comparison of two strings can be done efficiently by using the (==) and (!=) operators. If the strings are equal, it shows a true result otherwise false.

How do you find the Unicode value of a string in Python?

In Python, the built-in functions chr() and ord() are used to convert between Unicode code points and characters. A character can also be represented by writing a hexadecimal Unicode code point with \x , \u , or \U in a string literal.

How do I count length of a string Python?

To calculate the length of a string in Python, you can use the built-in len() method. It takes a string as a parameter and returns an integer as the length of that string. For example, len(“educative”) will return 9 because there are 9 characters in “educative”.

What does Unicode () do in Python?

If encoding and/or errors are given, unicode() will decode the object which can either be an 8-bit string or a character buffer using the codec for encoding. The encoding parameter is a string giving the name of an encoding; if the encoding is not known, LookupError is raised.


3 Answers

The unicodedata module has a function combining that can be used to determine if a single character is a combining character. If it returns 0 you can count the character as non-combining.

import unicodedata
len(u''.join(ch for ch in u'A\u0332\u0305BC' if unicodedata.combining(ch) == 0))

or, slightly simpler:

sum(1 for ch in u'A\u0332\u0305BC' if unicodedata.combining(ch) == 0)
like image 84
Mark Ransom Avatar answered Nov 05 '22 03:11

Mark Ransom


If you have a regex flavor that supports matching grapheme, you can use \X

Demo

While the default Python re module does not support \X, Matthew Barnett's regex module does:

>>> len(regex.findall(r'\X', u'A\u0332\u0305BC'))
3

On Python 2, you need to use u in the pattern:

>>> regex.findall(u'\\X', u'A\u0332\u0305BC')
[u'A\u0332\u0305', u'B', u'C']
>>> len(regex.findall(u'\\X', u'A\u0332\u0305BC'))
3
like image 39
dawg Avatar answered Nov 05 '22 04:11

dawg


Combining characters are not the only zero-width characters:

>>> sum(1 for ch in u'\u200c' if unicodedata.combining(ch) == 0)
1

("\u200c" or "‌" is zero-width non-joiner; it's a non-printing character.)

In this case the regex module does not work either:

>>> len(regex.findall(r'\X', u'\u200c'))
1

I found wcwidth that handles the above case correctly:

>>> from wcwidth import wcswidth
>>> wcswidth(u'A\u0332\u0305BC')
3
>>> wcswidth(u'\u200c')
0

But still doesn't seem to work with user 596219's example:

>>> wcswidth('각')
4
like image 3
AXO Avatar answered Nov 05 '22 04:11

AXO