Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Detect unicode language of a string in javascript

I have a string that contains a few words. I want to find out all the words that contain only characters of Tamil Unicode. I am new to javascript.

Using Go, I do the same like:

            tokens := strings.Fields(stringContent, delim) // split based on delim, say space

            for _, token := range tokens { //like foreach
                r, l := utf8.DecodeRuneInString(token)
                if l != 1 {
                    if unicode.Is(unicode.Tamil, r) {
                        // Tamil word
                    }
                }
            }

I found that string.split() will give me the individual words based on the delimiter, in javascript. But I am not able to find out how to get if the word is a UTF-8 TAMIL word. Can someone help me achieve this in javascript ?

like image 519
Sankar Avatar asked Dec 26 '22 19:12

Sankar


1 Answers

Easy way is to do a regular expression match for words having characters in a unicode range

Hope this helps : http://kourge.net/projects/regexp-unicode-block

A sample with which you can start

"இந்தியா ASASAS எறத்தாழ ASSASAS குடியரசு ASWED SAASAS".match(/[\u0B80-\u0BFF]+/g);
like image 169
Diode Avatar answered Dec 29 '22 08:12

Diode