Handling count of characters with diacritics in R

Question

I'm trying to get the number of characters in strings with characters with diacritics, but I can't manage to get the right result.

> x <- "n̥ala"
> nchar(x)
[1] 5

What I want to get is is 4, since n̥ should be considered one character (i.e. diacritics shouldn't be considered characters on their own, even with more than one diacritic stacked on a base character).

How can I get this kind of result?

SabDeM · Accepted Answer

Here is my solution. The idea is that phonetic alphabets can have an unicode representation and then:

Use Unicode package; it provide the function Unicode_alphabetic_tokenizer that:

Tokenization first replaces the elements of x by their Unicode character sequences. Then, the non- alphabetic characters (i.e., the ones which do not have the Alphabetic property) are replaced by blanks, and the corresponding strings are split according to the blanks.

After this I used the nchar but because the splitting it two substrings of the previous function I used a sum.

sum(nchar(Unicode_alphabetic_tokenizer(x)))
[1] 4

I believe this package can be very useful in such cases, but I am not an expert and I do not know if my solution works for all problems that involve phonetic alphabets. Maybe other examples might be useful to state the validity of my solution.

It works well

Here is another example:

> x <- "e̯ ʊ̯"
> x
[1] "e̯ ʊ̯"
> nchar(x)
[1] 5
> sum(nchar(Unicode_alphabetic_tokenizer(x)))
[1] 2

p.s. there is only one " in the code but copying and pasting it, the second one appears. I do not know why this happens.

Tyler Rinker · Answer

Here's a solution using the qdap package that I maintain:

x <- "n?ala"

library(qdap)
character_count(word)
## [1] 4

Handling count of characters with diacritics in R

Tags:

r

character-encoding

unicode

nlp

linguistics

Stefano

2 Answers

It works well

SabDeM

Tyler Rinker

Recent Activity

Donate For Us

Handling count of characters with diacritics in R

Tags:

r

character-encoding

unicode

nlp

linguistics

Stefano

2 Answers

It works well

SabDeM

Tyler Rinker

Related questions

Recent Activity

Donate For Us