How to split word from Thai sentence? English we can split word by space.
Example: I go to school
, split = ['I', 'go', 'to' ,'school']
Split by looking only space.
But Thai language had no space, so I don't know how to do. Example spit ฉันจะไปโรงเรียน to from txt file to ['ฉัน' 'จะ' 'ไป' 'โรง' 'เรียน'] = output another txt file.
Are there any programs or libraries that identify Thai word boundaries and split?
In 2006, someone contributed code to the Apache Lucene project to make this work.
Their approach (written in Java) was to use the BreakIterator class, calling getWordInstance()
to get a dictionary-based word iterator for the Thai language. Note also that there is a stated dependency on the ICU4J project. I have pasted the relevant section of their code below:
private BreakIterator breaker = null;
private Token thaiToken = null;
public ThaiWordFilter(TokenStream input) {
super(input);
breaker = BreakIterator.getWordInstance(new Locale("th"));
}
public Token next() throws IOException {
if (thaiToken != null) {
String text = thaiToken.termText();
int start = breaker.current();
int end = breaker.next();
if (end != BreakIterator.DONE) {
return new Token(text.substring(start, end),
thaiToken.startOffset()+start,
thaiToken.startOffset()+end, thaiToken.type());
}
thaiToken = null;
}
Token tk = input.next();
if (tk == null) {
return null;
}
String text = tk.termText();
if (UnicodeBlock.of(text.charAt(0)) != UnicodeBlock.THAI) {
return new Token(text.toLowerCase(),
tk.startOffset(),
tk.endOffset(),
tk.type());
}
thaiToken = tk;
breaker.setText(text);
int end = breaker.next();
if (end != BreakIterator.DONE) {
return new Token(text.substring(0, end),
thaiToken.startOffset(),
thaiToken.startOffset()+end,
thaiToken.type());
}
return null;
}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With