How does Chrome decide what to highlight when you double-click Japanese text?

Tags:

If you double-click English text in Chrome, the whitespace-delimited word you clicked on is highlighted. This is not surprising. However, the other day I was clicking while reading some text in Japanese and noticed that some words were highlighted at word boundaries, even though Japanese doesn't have spaces. Here's some example text:

どこで生れたかとんと見当がつかぬ。何でも薄暗いじめじめした所でニャーニャー泣いていた事だけは記憶している。

For example, if you click on 薄暗い, Chrome will correctly highlight it as a single word, even though it's not a single character class (this is a mix of kanji and hiragana). Not all the highlights are correct, but they don't seem random.

How does Chrome decide what to highlight here? I tried searching the Chrome source for "japanese word" but only found tests for an experimental module that doesn't seem active in my version of Chrome.

1000

asked May 08 '20 05:05

polm23

2 Answers

So it turns out v8 has a non-standard multi-language word segmenter and it handles Japanese.

function tokenizeJA(text) {
  var it = Intl.v8BreakIterator(['ja-JP'], {type:'word'})
  it.adoptText(text)
  var words = []

  var cur = 0, prev = 0

  while (cur < text.length) {
    prev = cur
    cur = it.next()
    words.push(text.substring(prev, cur))
  }

  return words
}

console.log(tokenizeJA('どこで生れたかとんと見当がつかぬ。何でも薄暗いじめじめした所でニャーニャー泣いていた事だけは記憶している。'))
// ["どこ", "で", "生れ", "たか", "とんと", "見当", "が", "つ", "か", "ぬ", "。", "何でも", "薄暗い", "じめじめ", "した", "所", "で", "ニャーニャー", "泣", "い", "て", "いた事", "だけ", "は", "記憶", "し", "て", "いる", "。"]

I also made a jsfiddle that shows this.

The quality is not amazing but I'm surprised this is supported at all.

answered Oct 10 '22 05:10

polm23

Based on links posted by JonathonW, the answer basically boils down to: "There's a big list of Japanese words and Chrome checks to see if you double-clicked in a word."

Specifically, v8 uses ICU to do a bunch of Unicode-related text processing things, including breaking text up into words. The ICU boundary-detection code includes a "Dictionary-Based BreakIterator" for languages that don't have spaces, including Japanese, Chinese, Thai, etc.

And for your specific example of "薄暗い", you can find that word in the combined Chinese-Japanese dictionary shipped by ICU (line 255431). There are currently 315,671 total Chinese/Japanese words in the list. Presumably if you find a word that Chrome doesn't split properly, you could send ICU a patch to add that word.

answered Oct 10 '22 04:10

erjiang

Related questions
                            
                                Switch statement for string matching in JavaScript
                            
                                Replace a string in a file with nodejs
                            
                                Get the first key name of a JavaScript object [duplicate]
                            
                                RegEx to extract all matches from string using RegExp.exec
                            
                                How might I find the largest number contained in a JavaScript array?
                            
                                When is .then(success, fail) considered an antipattern for promises?
                            
                                Getting All Variables In Scope
                            
                                toBe(true) vs toBeTruthy() vs toBeTrue()
                            
                                Why does typeof array with objects return "object" and not "array"? [duplicate]
                            
                                Callback when CSS3 transition finishes
                            
                                jQuery’s .bind() vs. .on()
                            
                                Function to calculate distance between two coordinates
                            
                                Failed to instantiate module [$injector:unpr] Unknown provider: $routeProvider
                            
                                How to run two jQuery animations simultaneously?
                            
                                HTTP GET Request in Node.js Express
                            
                                Concatenating variables and strings in React
                            
                                HTML text-overflow ellipsis detection
                            
                                Stop form refreshing page on submit
                            
                                How to compare software version number using js? (only number)
                            
                                How to get element by innerText

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How does Chrome decide what to highlight when you double-click Japanese text?

Tags:

javascript

google-chrome

cjk

polm23

People also ask

2 Answers

polm23

erjiang

Recent Activity

Donate For Us