How do I count letters in a string?

Tags:

Say I have a UTF-8 str, for example

my_str = "नमस्ते"  # ['न', 'म', 'स', '्', 'त', 'े']

how do I find how many letters it contains? len(my_str) returns 6, which is how many Unicode code points it contains. It's actually 4 letters long.

And bonus question: some languages define digraphs as a single letter (for example "Dh" is the 6th letter of the modern Albanian alphabet), how can I handle that edge case?

242

asked Aug 22 '19 22:08

Boris

1 Answers

You want to segment text. This is governed in Unicode by UAX #29.

4 letters long

That terminology is incorrect/too narrow, it should say "4 grapheme clusters long".

Use the uniseg library:

from uniseg.graphemecluster import grapheme_clusters
for text in ('नमस्ते', 'Bo\u0304ris', 'Ꙝ̛͖͋҉ᄀᄀᄀ각ᆨᆨ'):
    print(list(grapheme_clusters(text)))
#['न', 'म', 'स्', 'ते']
#['B', 'ō', 'r', 'i', 's']
#['Ꙝ̛͋', 'ᄀᄀᄀ각ᆨᆨ']

# treat digraph 'dh' as a customised grapheme cluster
def albanian_digraph_dh(s, breakables):
    for i, breakable in enumerate(breakables):
        if s.endswith('d', 0, i) and s.startswith('h', i):
            yield 0
        else:
            yield breakable

# you can do all the digraphs like this
ALBANIAN_DIGRAPHS = {"Dh", "Gj", "Ll", "Nj", "Rr", "Sh", "Th", "Xh", "Zh"}
ALBANIAN_DIGRAPHS |= {digraph.lower() for digraph in ALBANIAN_DIGRAPHS}
def albanian_digraphs(s, breakables):
    for i, breakable in enumerate(breakables):
        yield 0 if s[i-1:i+1] in ALBANIAN_DIGRAPHS else breakable

# from https://sq.wiktionary.org/wiki/Speciale:PrefixIndex?prefix=dh
for text in ('dhallanik', 'dhelpëror', 'dhembshurisht', 'dhevështrues', 'dhimbshëm', 'dhjamosje', 'dhjetëballësh', 'dhjetëminutësh', 'dhogaç', 'dhogiç', 'dhomë-muze', 'dhuratë', 'dhëmbinxhi', 'dhëmbçoj', 'dhëmbëkatarosh'):
    print(list(grapheme_clusters(text, albanian_digraphs)))

#['dh', 'a', 'll', 'a', 'n', 'i', 'k']
#['dh', 'e', 'l', 'p', 'ë', 'r', 'o', 'r']
#['dh', 'e', 'm', 'b', 'sh', 'u', 'r', 'i', 'sh', 't']
#['dh', 'e', 'v', 'ë', 'sh', 't', 'r', 'u', 'e', 's']
#['dh', 'i', 'm', 'b', 'sh', 'ë', 'm']
#['dh', 'j', 'a', 'm', 'o', 's', 'j', 'e']
#['dh', 'j', 'e', 't', 'ë', 'b', 'a', 'll', 'ë', 'sh']
#['dh', 'j', 'e', 't', 'ë', 'm', 'i', 'n', 'u', 't', 'ë', 'sh']
#['dh', 'o', 'g', 'a', 'ç']
#['dh', 'o', 'g', 'i', 'ç']
#['dh', 'o', 'm', 'ë', '-', 'm', 'u', 'z', 'e']
#['dh', 'u', 'r', 'a', 't', 'ë']
#['dh', 'ë', 'm', 'b', 'i', 'n', 'xh', 'i']
#['dh', 'ë', 'm', 'b', 'ç', 'o', 'j']
#['dh', 'ë', 'm', 'b', 'ë', 'k', 'a', 't', 'a', 'r', 'o', 'sh']

You can install it with

pip install uniseg

142

answered Sep 28 '22 06:09

daxim

Related questions
                            
                                flask-restful - resource class for current request
                            
                                Split string every n characters but without splitting a word [duplicate]
                            
                                OpenCV template matching, multiple templates
                            
                                Why do I get a warning when concatenating lists of mixed types in Pycharm?
                            
                                Gcloudignore file is not respected during deployment to App Engine
                            
                                Changing the log level of an imported module
                            
                                pandas GroupBy and cumulative mean of previous rows in group
                            
                                Parsing Mbox from an open file-like object in Python?
                            
                                How not to start same task and wait until it is finished with celery beat
                            
                                What is the difference between partitioning and bucketing in Spark?
                            
                                Combining serializer and model functions
                            
                                Filling cell based on existing cells
                            
                                Implementation of sklearn.impute.IterativeImputer
                            
                                How to get only function blocks using sly
                            
                                How does pandas Dataframe.loc accept the [...] syntax?
                            
                                What are the Tensorflow qint8, quint8, qint32, qint16, and quint16 datatypes?
                            
                                impossible to catch asyncio.TimeoutError?
                            
                                How to sort a list by length and then in reverse alphabetical order
                            
                                Intel MKL FATAL ERROR: Cannot load mkl_intel_thread.dll
                            
                                What solver should I use if my objective function is an nonlinear (also exponential explanation) function? Python GEKKO

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How do I count letters in a string?

Tags:

python

unicode

python-unicode

unicode-string

Boris

People also ask

1 Answers

daxim

Recent Activity

Donate For Us