Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How this mixed-character string split on unicode word boundaries

Consider the string "abc를". According to unicode's demo implementation of word segmentation, this string should be split into two words, "abc" and "를". However, 3 different Rust implementations of word boundary detection (regex, unic-segment, unicode-segmentation) have all disagreed, and grouped that string into one word. Which behavior is correct?

As a follow up, if the grouped behavior is correct, what would be a good way to scan this string for the search term "abc" in a way that still mostly respects word boundaries (for the purpose of checking the validity of string translations). I'd want to match something like "abc를" but don't match something like abcdef.

like image 549
Lucretiel Avatar asked Feb 06 '21 20:02

Lucretiel


1 Answers

I'm not so certain that the demo for word segmentation should be taken as the ground truth, even if it is on an official site. For example, it considers "abc를" ("abc\uB97C") to be two separate words but considers "abc를" ("abc\u1105\u1173\u11af") to be one, even though the former decomposes to the latter.

The idea of a word boundary isn't exactly set in stone. Unicode has a Word Boundary specification which outlines where word-breaks should and should not occurr. However, it has an extensive notes section for elaborating on other cases (emphasis mine):

It is not possible to provide a uniform set of rules that resolves all issues across languages or that handles all ambiguous situations within a given language. The goal for the specification presented in this annex is to provide a workable default; tailored implementations can be more sophisticated.

For Thai, Lao, Khmer, Myanmar, and other scripts that do not typically use spaces between words, a good implementation should not depend on the default word boundary specification. It should use a more sophisticated mechanism, as is also required for line breaking. Ideographic scripts such as Japanese and Chinese are even more complex. Where Hangul text is written without spaces, the same applies. However, in the absence of a more sophisticated mechanism, the rules specified in this annex supply a well-defined default.

...

My understanding is that the crates you list are following the spec without further contextual analysis. Why the demo disagrees I cannot say, but it may be an attempt to implement one of these edge cases.


To address your specific problem, I'd suggest using Regex with \b for matching a word boundary. This unfortunately follows the same unicode rules and will not consider "를" to be a new word. However, this regex implementation offers an escape hatch to fallback to ascii behaviour. Simply use (?-u:\b) to match a non-unicode boundary:

use regex::Regex;

fn main() {
    let pattern = Regex::new("(?-u:\\b)abc(?-u:\\b)").unwrap();
    println!("{:?}", pattern.find("some abcdef abc를 sentence"));
}

You can run it for yourself on the playground to test your cases and see if this works for you.

like image 73
kmdreko Avatar answered Sep 29 '22 09:09

kmdreko