Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Stanford POS Tagger not tagging Chinese text

I'm using Stanford POS Tagger (for the first time) and while it tags English correctly, it does not seem to recognize (Simplified) Chinese even when changing the model parameter. Have I overlooked something?

I've downloaded and unpacked the latest full version from here: http://nlp.stanford.edu/software/tagger.shtml

Then I've inputed sample text into the "sample-input.txt".

这是一个测试的句子。这是另一个句子。

Then I simply run

./stanford-postagger.sh models/chinese-distsim.tagger sample-input.txt

The expected output is to tag each of the words with a part of speech, but instead it recognizes the entire string of text as one word:

Loading default properties from tagger models/chinese-distsim.tagger

Reading POS tagger model from models/chinese-distsim.tagger ... done [3.5 sec].

這是一個測試的句子。這是另一個句子。#NR

Tagged 1 words at 30.30 words per second.

I appreciate any help.

like image 488
Ryan Rapp Avatar asked Apr 18 '13 04:04

Ryan Rapp


1 Answers

I finally realized that tokenization/segmentation is not included in this pos tagger. It appears the words must be space delimited before feeding them to the tagger. For those interested in maximum entropy word segmentation of Chinese, there is a separate package available here:

http://nlp.stanford.edu/software/segmenter.shtml

Thanks everyone.

like image 172
Ryan Rapp Avatar answered Oct 12 '22 23:10

Ryan Rapp