I want to find frequency of different letters in a text, and some of them use diacritics. As example the text uses both 'å' and 'ą̊ '(U+00E5 U+0328) and the frequency needs to be counted for separately.
How do I do that?
I've tried using Counter collection, open the file using utf8 format, split the text string using both text.split()
and list(text)
, but python still counts 'å' and 'ą̊ ' as same letter!
Diacritics are marks placed above or below (or sometimes next to) a letter in a word to indicate a particular pronunciation—in regard to accent, tone, or stress—as well as meaning, especially when a homograph exists without the marked letter or letters.
Diacritics, often loosely called `accents', are the various little dots and squiggles which, in many languages, are written above, below or on top of certain letters of the alphabet to indicate something about their pronunciation.
A diacritic (also diacritical mark, diacritical point, diacritical sign, or accent) is a glyph added to a letter or to a basic glyph. The term derives from the Ancient Greek διακριτικός (diakritikós, "distinguishing"), from διακρίνω (diakrī́nō, "to distinguish").
12 Types Of Diacritical Marks And How To Type Them. What Is A Diacritical Mark?
The problem here is that unicode text (forget about utf-8, I am talking after decoding your data to proper Python 3 strings) uses more than one unicode code point for some characters: 'ą̊ ' for example has two marks, so while both "ą" and "å" can exist as a single character after proper normalization, a character that takes both marks have to use one of the "combining mark" characters in unicode.
That means that Python Counter
alone won't be able to handle it, without at least an extra step. In Python code, the way to findout about these marker characters is by using unicodedata.category
- and it is not that friendly, it just returns a two-character identifier for the category.
So, I think one thing that can be done is pre-process your text into a list where each character and its markings are normalized, using some "pure Python" code. Then, Counter could do its job.
It could be something along:
import unicodedata
from collections import Counter
characters = []
text = ...
# Decompose all characters into plain letters + marking diacritics:
text = unicodedata.normalize("NFD", text)
for character in text:
if unicodedata.category(character)[0] == "M":
# character is a composing mark, so agregate it with
# previous character
characters[-1] += character
else:
characters.append(character)
counting = Counter(characters)
(Note that the snippet above does not take into account a potential malformed text snippet, that would start with a marking character in position 0)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With