Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Splitting chinese document into sentences [closed]

I have to split Chinese text into multiple sentences. I tried the Stanford DocumentPreProcessor. It worked quite well for English but not for Chinese.

Please can you let me know any good sentence splitters for Chinese preferably in Java or Python.

like image 263
pjesudhas Avatar asked Dec 12 '14 10:12

pjesudhas


2 Answers

Using some regex tricks in Python (c.f. a modified regex of Section 2.3 of http://aclweb.org/anthology/Y/Y11/Y11-1038.pdf):

import re

paragraph = u'\u70ed\u5e26\u98ce\u66b4\u5c1a\u5854\u5c14\u662f2001\u5e74\u5927\u897f\u6d0b\u98d3\u98ce\u5b63\u7684\u4e00\u573a\u57288\u6708\u7a7f\u8d8a\u4e86\u52a0\u52d2\u6bd4\u6d77\u7684\u5317\u5927\u897f\u6d0b\u70ed\u5e26\u6c14\u65cb\u3002\u5c1a\u5854\u5c14\u4e8e8\u670814\u65e5\u7531\u70ed\u5e26\u5927\u897f\u6d0b\u7684\u4e00\u80a1\u4e1c\u98ce\u6ce2\u53d1\u5c55\u800c\u6210\uff0c\u5176\u5b58\u5728\u7684\u5927\u90e8\u5206\u65f6\u95f4\u91cc\u90fd\u5728\u5feb\u901f\u5411\u897f\u79fb\u52a8\uff0c\u9000\u5316\u6210\u4e1c\u98ce\u6ce2\u540e\u7a7f\u8d8a\u4e86\u5411\u98ce\u7fa4\u5c9b\u3002'

def zng(paragraph):
    for sent in re.findall(u'[^!?。\.\!\?]+[!?。\.\!\?]?', paragraph, flags=re.U):
        yield sent

list(zng(paragraph))

Regex explanation: https://regex101.com/r/eNFdqM/2


like image 176
alvas Avatar answered Jan 03 '23 14:01

alvas


Either of these open sources projects should be useful afaik:

  • HanLP https://github.com/hankcs/HanLP
  • FudanNLP https://github.com/FudanNLP/fnlp
like image 33
langkilde Avatar answered Jan 03 '23 13:01

langkilde