When tokenizing texts that contain both Chinese and English, the result will split English words into letters, which is not what I want. Consider the following code:
from nltk.tokenize.stanford_segmenter import StanfordSegmenter
segmenter = StanfordSegmenter()
segmenter.default_config('zh')
print(segmenter.segment('哈佛大学的Melissa Dell'))
The output will be 哈佛大学 的 M e l i s s a D e l l
. How do I modify this behavior?
You could try jieba.
import jieba
jieba.lcut('哈佛大学的Melissa Dell')
['哈佛大学', '的', 'Melissa', ' ', 'Dell']
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With