Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

remove Niqqud from string in javascript

I have the exact problem described here:

removing Hebrew "niqqud" using r

Have been struggling to remove niqqud ( diacritical signs used to represent vowels or distinguish between alternative pronunciations of letters of the Hebrew alphabet). I have for instance this variable: sample1 <- "הֻסְמַק"

And i cannot find effective way to remove the signs below the letters.

But in my case i have to do this in javascript.

Based of UTF-8 values table described here, I have tried this regex without success.

like image 685
Dorad Avatar asked Sep 20 '25 12:09

Dorad


2 Answers

Just a slight problem with your regex. Try the following:

const input = "הֻסְמַק";
console.log(input)
console.log(input.replace(/[\u0591-\u05C7]/g, ''));

/*
$ node index.js
הֻסְמַק
הסמק
*/
like image 82
nj_ Avatar answered Sep 23 '25 01:09

nj_


nj_’s answer is great.

Just to add a bit (because I don’t have enough reputation points to comment directly) -

[\u0591-\u05C7] may be too broad a brush. See the relevant table here: https://en.wikipedia.org/wiki/Unicode_and_HTML_for_the_Hebrew_alphabet#Compact_table

Rows 059x and 05AX are for t'amim (accents/cantillation marks). Niqud per se is in rows 05Bx and 05Cx.

And as Avraham commented, you can run into an issues if 2 words are joined by a makaf (05BE), then by removing that you will end up with run-on words.

If you want to remove only t’amim but keep nikud, use /[\u0591-\u05AF]/g. If you want to avoid the issue raised by Avraham, you have 2 options - either keep the maqaf, or replace it with a dash:

//keep the original makafim
const input = "כִּי־טוֹב"
console.log(input)
console.log(input.replace(/([\u05B0-\u05BD]|[\u05BF-\u05C7])/g,""));

//replace makafim with dashes
console.log(input.replace(/\u05BE/g,"-").replace(/[\u05B0-\u05C7]/g,""))

/*
$ node index.js
כִּי־טֽוֹב
כי־טוב
כי-טוב
*/
like image 27
zuchmir Avatar answered Sep 23 '25 01:09

zuchmir