Say I have a UTF-8 str
, for example
my_str = "नमस्ते" # ['न', 'म', 'स', '्', 'त', 'े']
how do I find how many letters it contains? len(my_str)
returns 6, which is how many Unicode code points it contains. It's actually 4 letters long.
And bonus question: some languages define digraphs as a single letter (for example "Dh" is the 6th letter of the modern Albanian alphabet), how can I handle that edge case?
String str = "9as78"; Now loop through the length of this string and use the Character. isLetter() method. Within that, use the charAt() method to check for each character/ number in the string.
To use the function, enter =LEN(cell) in the formula bar, then press Enter on your keyboard. Multiple cells: To apply the same formula to multiple cells, enter the formula in the first cell and then drag the fill handle down (or across) the range of cells.
In Python, you can get the length of a string str (= number of characters) with the built-in function len() .
The count() method returns the number of occurrences of a substring in the given string.
You want to segment text. This is governed in Unicode by UAX #29.
4 letters long
That terminology is incorrect/too narrow, it should say "4 grapheme clusters long".
Use the uniseg
library:
from uniseg.graphemecluster import grapheme_clusters
for text in ('नमस्ते', 'Bo\u0304ris', 'Ꙝ̛͖͋҉ᄀᄀᄀ각ᆨᆨ'):
print(list(grapheme_clusters(text)))
#['न', 'म', 'स्', 'ते']
#['B', 'ō', 'r', 'i', 's']
#['Ꙝ̛͋', 'ᄀᄀᄀ각ᆨᆨ']
# treat digraph 'dh' as a customised grapheme cluster
def albanian_digraph_dh(s, breakables):
for i, breakable in enumerate(breakables):
if s.endswith('d', 0, i) and s.startswith('h', i):
yield 0
else:
yield breakable
# you can do all the digraphs like this
ALBANIAN_DIGRAPHS = {"Dh", "Gj", "Ll", "Nj", "Rr", "Sh", "Th", "Xh", "Zh"}
ALBANIAN_DIGRAPHS |= {digraph.lower() for digraph in ALBANIAN_DIGRAPHS}
def albanian_digraphs(s, breakables):
for i, breakable in enumerate(breakables):
yield 0 if s[i-1:i+1] in ALBANIAN_DIGRAPHS else breakable
# from https://sq.wiktionary.org/wiki/Speciale:PrefixIndex?prefix=dh
for text in ('dhallanik', 'dhelpëror', 'dhembshurisht', 'dhevështrues', 'dhimbshëm', 'dhjamosje', 'dhjetëballësh', 'dhjetëminutësh', 'dhogaç', 'dhogiç', 'dhomë-muze', 'dhuratë', 'dhëmbinxhi', 'dhëmbçoj', 'dhëmbëkatarosh'):
print(list(grapheme_clusters(text, albanian_digraphs)))
#['dh', 'a', 'll', 'a', 'n', 'i', 'k']
#['dh', 'e', 'l', 'p', 'ë', 'r', 'o', 'r']
#['dh', 'e', 'm', 'b', 'sh', 'u', 'r', 'i', 'sh', 't']
#['dh', 'e', 'v', 'ë', 'sh', 't', 'r', 'u', 'e', 's']
#['dh', 'i', 'm', 'b', 'sh', 'ë', 'm']
#['dh', 'j', 'a', 'm', 'o', 's', 'j', 'e']
#['dh', 'j', 'e', 't', 'ë', 'b', 'a', 'll', 'ë', 'sh']
#['dh', 'j', 'e', 't', 'ë', 'm', 'i', 'n', 'u', 't', 'ë', 'sh']
#['dh', 'o', 'g', 'a', 'ç']
#['dh', 'o', 'g', 'i', 'ç']
#['dh', 'o', 'm', 'ë', '-', 'm', 'u', 'z', 'e']
#['dh', 'u', 'r', 'a', 't', 'ë']
#['dh', 'ë', 'm', 'b', 'i', 'n', 'xh', 'i']
#['dh', 'ë', 'm', 'b', 'ç', 'o', 'j']
#['dh', 'ë', 'm', 'b', 'ë', 'k', 'a', 't', 'a', 'r', 'o', 'sh']
You can install it with
pip install uniseg
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With