Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to tokenize Chinese language document

Tags:

java

tokenize

I will be getting document written in Chinese language for which I have to tokenize and keep it in database table. I was trying the CJKBigramFilter of Lucene but all it does is unite the 2 character together for which the meaning is different then what is there in document. Suppose this is a line in the file "Hello My name is Pradeep" which in chinese tradition is "你好我的名字是普拉迪普". When I tokenize it, it gets converted to the 2 letter words below. 你好 - Hello 名字 - Name 好我 - Well I 字是 - Word is 我的 - My 拉迪 - Radi 是普 - Is the S & P 普拉 - Pula 的名 - In the name of 迪普 - Dipp. All I want is it to convert to same English translation. I am using Lucene for this...if you have any other favourable opne source please direct me to that. Thanks in Advance

like image 940
Pradeep Avatar asked Feb 19 '23 03:02

Pradeep


1 Answers

Though may be too late, you might try U-Tokenizer which is an online API, it is available for free. See http://tokenizer.tool.uniwits.com/

like image 133
Afante Avatar answered Mar 04 '23 02:03

Afante