Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to count correctly letters with diacritics in text?

I want to find frequency of different letters in a text, and some of them use diacritics. As example the text uses both 'å' and 'ą̊ '(U+00E5 U+0328) and the frequency needs to be counted for separately.

How do I do that?

I've tried using Counter collection, open the file using utf8 format, split the text string using both text.split() and list(text), but python still counts 'å' and 'ą̊ ' as same letter!

like image 673
user11448 Avatar asked Oct 30 '17 22:10

user11448


People also ask

How do you use diacritics?

Diacritics are marks placed above or below (or sometimes next to) a letter in a word to indicate a particular pronunciation—in regard to accent, tone, or stress—as well as meaning, especially when a homograph exists without the marked letter or letters.

What are the accents over letters called?

Diacritics, often loosely called `accents', are the various little dots and squiggles which, in many languages, are written above, below or on top of certain letters of the alphabet to indicate something about their pronunciation.

What is the hyphen above a letter called?

A diacritic (also diacritical mark, diacritical point, diacritical sign, or accent) is a glyph added to a letter or to a basic glyph. The term derives from the Ancient Greek διακριτικός (diakritikós, "distinguishing"), from διακρίνω (diakrī́nō, "to distinguish").

How many diacritics are there?

12 Types Of Diacritical Marks And How To Type Them. What Is A Diacritical Mark?


1 Answers

The problem here is that unicode text (forget about utf-8, I am talking after decoding your data to proper Python 3 strings) uses more than one unicode code point for some characters: 'ą̊ ' for example has two marks, so while both "ą" and "å" can exist as a single character after proper normalization, a character that takes both marks have to use one of the "combining mark" characters in unicode.

That means that Python Counter alone won't be able to handle it, without at least an extra step. In Python code, the way to findout about these marker characters is by using unicodedata.category - and it is not that friendly, it just returns a two-character identifier for the category.

So, I think one thing that can be done is pre-process your text into a list where each character and its markings are normalized, using some "pure Python" code. Then, Counter could do its job.

It could be something along:

import unicodedata
from collections import Counter

characters = []

text = ...

# Decompose all characters into plain letters + marking diacritics:
text = unicodedata.normalize("NFD", text)
for character in text:
    if unicodedata.category(character)[0] == "M": 
        # character is a composing mark, so agregate it with
        # previous character
        characters[-1] += character
    else:
        characters.append(character)

counting = Counter(characters)

(Note that the snippet above does not take into account a potential malformed text snippet, that would start with a marking character in position 0)

like image 126
jsbueno Avatar answered Sep 29 '22 06:09

jsbueno