I will be getting document written in Chinese language for which I have to tokenize and keep it in database table. I was trying the CJKBigramFilter of Lucene but all it does is unite the 2 character together for which the meaning is different then what is there in document. Suppose this is a line in the file "Hello My name is Pradeep" which in chinese tradition is "你好我的名字是普拉迪普". When I tokenize it, it gets converted to the 2 letter words below. 你好 - Hello 名字 - Name 好我 - Well I 字是 - Word is 我的 - My 拉迪 - Radi 是普 - Is the S & P 普拉 - Pula 的名 - In the name of 迪普 - Dipp. All I want is it to convert to same English translation. I am using Lucene for this...if you have any other favourable opne source please direct me to that. Thanks in Advance
Though may be too late, you might try U-Tokenizer which is an online API, it is available for free. See http://tokenizer.tool.uniwits.com/
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With