Splitting chinese document into sentences [closed]

Question

I have to split Chinese text into multiple sentences. I tried the Stanford DocumentPreProcessor. It worked quite well for English but not for Chinese.

Please can you let me know any good sentence splitters for Chinese preferably in Java or Python.

alvas · Accepted Answer

Using some regex tricks in Python (c.f. a modified regex of Section 2.3 of http://aclweb.org/anthology/Y/Y11/Y11-1038.pdf):

import re

paragraph = u'\u70ed\u5e26\u98ce\u66b4\u5c1a\u5854\u5c14\u662f2001\u5e74\u5927\u897f\u6d0b\u98d3\u98ce\u5b63\u7684\u4e00\u573a\u57288\u6708\u7a7f\u8d8a\u4e86\u52a0\u52d2\u6bd4\u6d77\u7684\u5317\u5927\u897f\u6d0b\u70ed\u5e26\u6c14\u65cb\u3002\u5c1a\u5854\u5c14\u4e8e8\u670814\u65e5\u7531\u70ed\u5e26\u5927\u897f\u6d0b\u7684\u4e00\u80a1\u4e1c\u98ce\u6ce2\u53d1\u5c55\u800c\u6210\uff0c\u5176\u5b58\u5728\u7684\u5927\u90e8\u5206\u65f6\u95f4\u91cc\u90fd\u5728\u5feb\u901f\u5411\u897f\u79fb\u52a8\uff0c\u9000\u5316\u6210\u4e1c\u98ce\u6ce2\u540e\u7a7f\u8d8a\u4e86\u5411\u98ce\u7fa4\u5c9b\u3002'

def zng(paragraph):
    for sent in re.findall(u'[^!?。\.\!\?]+[!?。\.\!\?]?', paragraph, flags=re.U):
        yield sent

list(zng(paragraph))

Regex explanation: https://regex101.com/r/eNFdqM/2

langkilde · Answer

Either of these open sources projects should be useful afaik:

HanLP https://github.com/hankcs/HanLP
FudanNLP https://github.com/FudanNLP/fnlp

Splitting chinese document into sentences [closed]

Tags:

tokenize

nlp

stanford-nlp

sentence

pjesudhas

2 Answers

alvas

langkilde

Recent Activity

Donate For Us

Splitting chinese document into sentences [closed]

Tags:

tokenize

nlp

stanford-nlp

sentence

pjesudhas

2 Answers

alvas

langkilde

Related questions

Recent Activity

Donate For Us