Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Javascript regular expression for searching word boundaries in Unicode string

Is there solution to find word boundaries in Japanese string (E.g.: "私はマーケットに行きました。") via JavaScript regular expressions("xregexp" JS library cab be used)?

E.g.:

var xr = RegExp("\\bst","g");
xr.test("The string") // --> true

I need the same logic for Japanese strings.

like image 563
Andrei Avatar asked Dec 27 '22 12:12

Andrei


1 Answers

However, the actual problem of separating the Japanese sentence into words is more complicated than it appears, since words are not separated into spaces as is the case, for example, in English.

For example, the sentence 私はマーケットに行きました。 ("I went to the market") has the following words:

  • 私 - watakushi
  • は - wa
  • マーケット - maaketto
  • に - ni
  • 行きました - ikimashita
  • 。 - (period)

A reliable parser of Japanese sentences would, among other things, have to find where the particles (wa and ni) lie in the sentence, in order to find the remaining words.

like image 70
Peter O. Avatar answered Dec 30 '22 10:12

Peter O.