Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do I count letters in a string?

Say I have a UTF-8 str, for example

my_str = "नमस्ते"  # ['न', 'म', 'स', '्', 'त', 'े']

how do I find how many letters it contains? len(my_str) returns 6, which is how many Unicode code points it contains. It's actually 4 letters long.

And bonus question: some languages define digraphs as a single letter (for example "Dh" is the 6th letter of the modern Albanian alphabet), how can I handle that edge case?

like image 242
Boris Avatar asked Aug 22 '19 22:08

Boris


People also ask

How do I count letters in a string in Java?

String str = "9as78"; Now loop through the length of this string and use the Character. isLetter() method. Within that, use the charAt() method to check for each character/ number in the string.

How do you count the number of letters?

To use the function, enter =LEN(cell) in the formula bar, then press Enter on your keyboard. Multiple cells: To apply the same formula to multiple cells, enter the formula in the first cell and then drag the fill handle down (or across) the range of cells.

How do you count letters in a string in Python?

In Python, you can get the length of a string str (= number of characters) with the built-in function len() .

Can you use count on a string?

The count() method returns the number of occurrences of a substring in the given string.


1 Answers

You want to segment text. This is governed in Unicode by UAX #29.

4 letters long

That terminology is incorrect/too narrow, it should say "4 grapheme clusters long".

Use the uniseg library:

from uniseg.graphemecluster import grapheme_clusters
for text in ('नमस्ते', 'Bo\u0304ris', 'Ꙝ̛͖͋҉ᄀᄀᄀ각ᆨᆨ'):
    print(list(grapheme_clusters(text)))
#['न', 'म', 'स्', 'ते']
#['B', 'ō', 'r', 'i', 's']
#['Ꙝ̛͋', 'ᄀᄀᄀ각ᆨᆨ']

# treat digraph 'dh' as a customised grapheme cluster
def albanian_digraph_dh(s, breakables):
    for i, breakable in enumerate(breakables):
        if s.endswith('d', 0, i) and s.startswith('h', i):
            yield 0
        else:
            yield breakable

# you can do all the digraphs like this
ALBANIAN_DIGRAPHS = {"Dh", "Gj", "Ll", "Nj", "Rr", "Sh", "Th", "Xh", "Zh"}
ALBANIAN_DIGRAPHS |= {digraph.lower() for digraph in ALBANIAN_DIGRAPHS}
def albanian_digraphs(s, breakables):
    for i, breakable in enumerate(breakables):
        yield 0 if s[i-1:i+1] in ALBANIAN_DIGRAPHS else breakable

# from https://sq.wiktionary.org/wiki/Speciale:PrefixIndex?prefix=dh
for text in ('dhallanik', 'dhelpëror', 'dhembshurisht', 'dhevështrues', 'dhimbshëm', 'dhjamosje', 'dhjetëballësh', 'dhjetëminutësh', 'dhogaç', 'dhogiç', 'dhomë-muze', 'dhuratë', 'dhëmbinxhi', 'dhëmbçoj', 'dhëmbëkatarosh'):
    print(list(grapheme_clusters(text, albanian_digraphs)))

#['dh', 'a', 'll', 'a', 'n', 'i', 'k']
#['dh', 'e', 'l', 'p', 'ë', 'r', 'o', 'r']
#['dh', 'e', 'm', 'b', 'sh', 'u', 'r', 'i', 'sh', 't']
#['dh', 'e', 'v', 'ë', 'sh', 't', 'r', 'u', 'e', 's']
#['dh', 'i', 'm', 'b', 'sh', 'ë', 'm']
#['dh', 'j', 'a', 'm', 'o', 's', 'j', 'e']
#['dh', 'j', 'e', 't', 'ë', 'b', 'a', 'll', 'ë', 'sh']
#['dh', 'j', 'e', 't', 'ë', 'm', 'i', 'n', 'u', 't', 'ë', 'sh']
#['dh', 'o', 'g', 'a', 'ç']
#['dh', 'o', 'g', 'i', 'ç']
#['dh', 'o', 'm', 'ë', '-', 'm', 'u', 'z', 'e']
#['dh', 'u', 'r', 'a', 't', 'ë']
#['dh', 'ë', 'm', 'b', 'i', 'n', 'xh', 'i']
#['dh', 'ë', 'm', 'b', 'ç', 'o', 'j']
#['dh', 'ë', 'm', 'b', 'ë', 'k', 'a', 't', 'a', 'r', 'o', 'sh']

You can install it with

pip install uniseg
like image 142
daxim Avatar answered Sep 28 '22 06:09

daxim