How to count correctly letters with diacritics in text?

Tags:

I want to find frequency of different letters in a text, and some of them use diacritics. As example the text uses both 'å' and 'ą̊ '(U+00E5 U+0328) and the frequency needs to be counted for separately.

How do I do that?

I've tried using Counter collection, open the file using utf8 format, split the text string using both text.split() and list(text), but python still counts 'å' and 'ą̊ ' as same letter!

673

asked Oct 30 '17 22:10

user11448

1 Answers

The problem here is that unicode text (forget about utf-8, I am talking after decoding your data to proper Python 3 strings) uses more than one unicode code point for some characters: 'ą̊ ' for example has two marks, so while both "ą" and "å" can exist as a single character after proper normalization, a character that takes both marks have to use one of the "combining mark" characters in unicode.

That means that Python Counter alone won't be able to handle it, without at least an extra step. In Python code, the way to findout about these marker characters is by using unicodedata.category - and it is not that friendly, it just returns a two-character identifier for the category.

So, I think one thing that can be done is pre-process your text into a list where each character and its markings are normalized, using some "pure Python" code. Then, Counter could do its job.

It could be something along:

import unicodedata
from collections import Counter

characters = []

text = ...

# Decompose all characters into plain letters + marking diacritics:
text = unicodedata.normalize("NFD", text)
for character in text:
    if unicodedata.category(character)[0] == "M": 
        # character is a composing mark, so agregate it with
        # previous character
        characters[-1] += character
    else:
        characters.append(character)

counting = Counter(characters)

(Note that the snippet above does not take into account a potential malformed text snippet, that would start with a marking character in position 0)

126

answered Sep 29 '22 06:09

jsbueno

Related questions
                            
                                Keras/TF: Time Distributed CNN+LSTM for visual recognition
                            
                                Python 3.5 - Get counter to report zero-frequency items
                            
                                Swaping two elements in a list shows unexpected behaviour
                            
                                how to store worker-local variables in dask/distributed
                            
                                Why can I use a variable in a function before it is defined in Python?
                            
                                Python print floats padded with spaces instead of zeros
                            
                                Celery upgrade (3.1->4.1) - Connection reset by peer
                            
                                DJANGO_SETTINGS_MODULE not defined
                            
                                pandas-compat: 'import pandas' gives AttributeError: module 'pandas' has no attribute 'compat'
                            
                                Python pytest cases for async and await method
                            
                                why does my convolution routine differ from numpy & scipy's?
                            
                                Numpy dtype - data type not understood
                            
                                How to use Python 3 with Google App Engine's Local Development Server
                            
                                Keras images with no subfolders
                            
                                Why does PyQt crashes without information? (exit code 0xC0000409)
                            
                                dask apply: AttributeError: 'DataFrame' object has no attribute 'name'
                            
                                Cannot import multi_gpu_model from keras.utils
                            
                                AttributeError: module 'tensorflow' has no attribute 'feature_column'
                            
                                Prevent duplicates from itertools.permutations
                            
                                Using Keras, How can I load weights generated from CuDNNLSTM into LSTM Model?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to count correctly letters with diacritics in text?

Tags:

python

python-3.x

unicode

user11448

People also ask

1 Answers

jsbueno

Recent Activity

Donate For Us