I'm using Stanford POS Tagger (for the first time) and while it tags English correctly, it does not seem to recognize (Simplified) Chinese even when changing the model parameter. Have I overlooked something?
I've downloaded and unpacked the latest full version from here: http://nlp.stanford.edu/software/tagger.shtml
Then I've inputed sample text into the "sample-input.txt".
这是一个测试的句子。这是另一个句子。
Then I simply run
./stanford-postagger.sh models/chinese-distsim.tagger sample-input.txt
The expected output is to tag each of the words with a part of speech, but instead it recognizes the entire string of text as one word:
Loading default properties from tagger models/chinese-distsim.tagger
Reading POS tagger model from models/chinese-distsim.tagger ... done [3.5 sec].
這是一個測試的句子。這是另一個句子。#NR
Tagged 1 words at 30.30 words per second.
I appreciate any help.
I finally realized that tokenization/segmentation is not included in this pos tagger. It appears the words must be space delimited before feeding them to the tagger. For those interested in maximum entropy word segmentation of Chinese, there is a separate package available here:
http://nlp.stanford.edu/software/segmenter.shtml
Thanks everyone.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With