Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Handling count of characters with diacritics in R

I'm trying to get the number of characters in strings with characters with diacritics, but I can't manage to get the right result.

> x <- "n̥ala"
> nchar(x)
[1] 5

What I want to get is is 4, since should be considered one character (i.e. diacritics shouldn't be considered characters on their own, even with more than one diacritic stacked on a base character).

How can I get this kind of result?

like image 710
Stefano Avatar asked Oct 20 '22 10:10

Stefano


2 Answers

Here is my solution. The idea is that phonetic alphabets can have an unicode representation and then:

Use Unicode package; it provide the function Unicode_alphabetic_tokenizer that:

Tokenization first replaces the elements of x by their Unicode character sequences. Then, the non- alphabetic characters (i.e., the ones which do not have the Alphabetic property) are replaced by blanks, and the corresponding strings are split according to the blanks.

After this I used the nchar but because the splitting it two substrings of the previous function I used a sum.

sum(nchar(Unicode_alphabetic_tokenizer(x)))
[1] 4

I believe this package can be very useful in such cases, but I am not an expert and I do not know if my solution works for all problems that involve phonetic alphabets. Maybe other examples might be useful to state the validity of my solution.

It works well

Here is another example:

> x <- "e̯ ʊ̯"
> x
[1] "e̯ ʊ̯"
> nchar(x)
[1] 5
> sum(nchar(Unicode_alphabetic_tokenizer(x)))
[1] 2

p.s. there is only one " in the code but copying and pasting it, the second one appears. I do not know why this happens.

like image 148
SabDeM Avatar answered Oct 22 '22 12:10

SabDeM


Here's a solution using the qdap package that I maintain:

x <- "n?ala"

library(qdap)
character_count(word)
## [1] 4
like image 29
Tyler Rinker Avatar answered Oct 22 '22 13:10

Tyler Rinker