Have been struggling to remove niqqud ( diacritical signs used to represent vowels or distinguish between alternative pronunciations of letters of the Hebrew alphabet). I have for instance this variable: sample1 <- "הֻסְמַק"
And i cannot find effective way to remove the signs below the letters.
tried stringer, with str_replace_all(sample1, "[^[:alnum:]]", "")
tried gsub('[:punct:]','',sample1)
no success... :-( any ideas?
You can use the \p{M}
Unicode category to match diacritics with Perl-like regex, and gsub
all of them in one go like this:
sample1 <- "הֻסְמַק"
gsub("\\p{M}", "", sample1, perl=T)
Result: [1] "הסמק"
See demo
\p{M}
or\p{Mark}
: a character intended to be combined with another character (e.g. accents, umlauts, enclosing boxes, etc.).
See more at Regular-Expressions.info, "Unicode Categories".
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With