Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to split a Thai sentence, which does not use spaces, into words?

How to split word from Thai sentence? English we can split word by space.

Example: I go to school, split = ['I', 'go', 'to' ,'school'] Split by looking only space.

But Thai language had no space, so I don't know how to do. Example spit ฉันจะไปโรงเรียน to from txt file to ['ฉัน' 'จะ' 'ไป' 'โรง' 'เรียน'] = output another txt file.

Are there any programs or libraries that identify Thai word boundaries and split?

like image 574
Impossible is Nothing Avatar asked Dec 11 '12 18:12

Impossible is Nothing


1 Answers

In 2006, someone contributed code to the Apache Lucene project to make this work.

Their approach (written in Java) was to use the BreakIterator class, calling getWordInstance() to get a dictionary-based word iterator for the Thai language. Note also that there is a stated dependency on the ICU4J project. I have pasted the relevant section of their code below:

  private BreakIterator breaker = null;
  private Token thaiToken = null;

  public ThaiWordFilter(TokenStream input) {
    super(input);
    breaker = BreakIterator.getWordInstance(new Locale("th"));
  }

  public Token next() throws IOException {
    if (thaiToken != null) {
      String text = thaiToken.termText();
      int start = breaker.current();
      int end = breaker.next();
      if (end != BreakIterator.DONE) {
        return new Token(text.substring(start, end), 
            thaiToken.startOffset()+start,
            thaiToken.startOffset()+end, thaiToken.type());
      }
      thaiToken = null;
    }
    Token tk = input.next();
    if (tk == null) {
      return null;
    }
    String text = tk.termText();
    if (UnicodeBlock.of(text.charAt(0)) != UnicodeBlock.THAI) {
      return new Token(text.toLowerCase(), 
                       tk.startOffset(), 
                       tk.endOffset(), 
                       tk.type());
    }
    thaiToken = tk;
    breaker.setText(text);
    int end = breaker.next();
    if (end != BreakIterator.DONE) {
      return new Token(text.substring(0, end), 
          thaiToken.startOffset(), 
          thaiToken.startOffset()+end,
          thaiToken.type());
    }
    return null;
  }
like image 69
mpontillo Avatar answered Oct 14 '22 21:10

mpontillo