removing Hebrew "niqqud" using r

Question

Have been struggling to remove niqqud ( diacritical signs used to represent vowels or distinguish between alternative pronunciations of letters of the Hebrew alphabet). I have for instance this variable: sample1 <- "הֻסְמַק"

And i cannot find effective way to remove the signs below the letters.

tried stringer, with str_replace_all(sample1, "[^[:alnum:]]", "") tried gsub('[:punct:]','',sample1)

no success... :-( any ideas?

Wiktor Stribiżew · Accepted Answer

You can use the \p{M} Unicode category to match diacritics with Perl-like regex, and gsub all of them in one go like this:

sample1 <- "הֻסְמַק"
gsub("\p{M}", "", sample1, perl=T)

Result: [1] "הסמק"

See demo

\p{M} or \p{Mark}: a character intended to be combined with another character (e.g. accents, umlauts, enclosing boxes, etc.).

See more at Regular-Expressions.info, "Unicode Categories".

removing Hebrew "niqqud" using r

Tags:

regex

text

r

unicode

hebrew

Dmitry Leykin

1 Answers

Wiktor Stribiżew

Recent Activity

Donate For Us

removing Hebrew "niqqud" using r

Tags:

regex

text

r

unicode

hebrew

Dmitry Leykin

1 Answers

Wiktor Stribiżew

Related questions

Recent Activity

Donate For Us