Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Tokenizing texts in both Chinese and English improperly splits English words into letters

When tokenizing texts that contain both Chinese and English, the result will split English words into letters, which is not what I want. Consider the following code:

from nltk.tokenize.stanford_segmenter import StanfordSegmenter
segmenter = StanfordSegmenter()
segmenter.default_config('zh')
print(segmenter.segment('哈佛大学的Melissa Dell'))

The output will be 哈佛大学 的 M e l i s s a D e l l. How do I modify this behavior?

like image 238
yhylord Avatar asked Oct 30 '22 02:10

yhylord


1 Answers

You could try jieba.

import jieba
jieba.lcut('哈佛大学的Melissa Dell')
['哈佛大学', '的', 'Melissa', ' ', 'Dell']
like image 174
28potato Avatar answered Dec 29 '22 19:12

28potato