Tokenizing texts in both Chinese and English improperly splits English words into letters

Question

When tokenizing texts that contain both Chinese and English, the result will split English words into letters, which is not what I want. Consider the following code:

from nltk.tokenize.stanford_segmenter import StanfordSegmenter
segmenter = StanfordSegmenter()
segmenter.default_config('zh')
print(segmenter.segment('哈佛大学的Melissa Dell'))

The output will be 哈佛大学的 M e l i s s a D e l l. How do I modify this behavior?

28potato · Accepted Answer

You could try jieba.

import jieba
jieba.lcut('哈佛大学的Melissa Dell')
['哈佛大学', '的', 'Melissa', ' ', 'Dell']

Tokenizing texts in both Chinese and English improperly splits English words into letters

Tags:

python-3.x

tokenize

nlp

nltk

stanford-nlp

yhylord

1 Answers

28potato

Recent Activity

Donate For Us

Tokenizing texts in both Chinese and English improperly splits English words into letters

Tags:

python-3.x

tokenize

nlp

nltk

stanford-nlp

yhylord

1 Answers

28potato

Related questions

Recent Activity

Donate For Us