I have the exact problem described here:
removing Hebrew "niqqud" using r
Have been struggling to remove niqqud ( diacritical signs used to represent vowels or distinguish between alternative pronunciations of letters of the Hebrew alphabet). I have for instance this variable: sample1 <- "הֻסְמַק"
And i cannot find effective way to remove the signs below the letters.
But in my case i have to do this in javascript.
Based of UTF-8 values table described here, I have tried this regex without success.
Just a slight problem with your regex. Try the following:
const input = "הֻסְמַק";
console.log(input)
console.log(input.replace(/[\u0591-\u05C7]/g, ''));
/*
$ node index.js
הֻסְמַק
הסמק
*/
nj_’s answer is great.
Just to add a bit (because I don’t have enough reputation points to comment directly) -
[\u0591-\u05C7]
may be too broad a brush. See the relevant table here: https://en.wikipedia.org/wiki/Unicode_and_HTML_for_the_Hebrew_alphabet#Compact_table
Rows 059x
and 05AX
are for t'amim (accents/cantillation marks).
Niqud per se is in rows 05Bx
and 05Cx
.
And as Avraham commented, you can run into an issues if 2 words are joined by a makaf (05BE
), then by removing that you will end up with run-on words.
If you want to remove only t’amim but keep nikud, use /[\u0591-\u05AF]/g
. If you want to avoid the issue raised by Avraham, you have 2 options - either keep the maqaf, or replace it with a dash:
//keep the original makafim
const input = "כִּי־טוֹב"
console.log(input)
console.log(input.replace(/([\u05B0-\u05BD]|[\u05BF-\u05C7])/g,""));
//replace makafim with dashes
console.log(input.replace(/\u05BE/g,"-").replace(/[\u05B0-\u05C7]/g,""))
/*
$ node index.js
כִּי־טֽוֹב
כי־טוב
כי-טוב
*/
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With