Logo Questions Linux Laravel Mysql Ubuntu Git Menu

Python: Split unicode string on word boundaries

I need to take a string, and shorten it to 140 characters.

Currently I am doing:

if len(tweet) > 140:
    tweet = re.sub(r"\s+", " ", tweet) #normalize space
    footer = "… " + utils.shorten_urls(post['url'])
    avail = 140 - len(footer)
    words = tweet.split()
    result = ""
    for word in words:
        word += " "
        if len(word) > avail:
        result += word
        avail -= len(word)
    tweet = (result + footer).strip()
    assert len(tweet) <= 140

So this works great for English, and English like strings, but fails for a Chinese string because tweet.split() just returns one array:

>>> s = u"简讯:新華社報道,美國總統奧巴馬乘坐的「空軍一號」專機晚上10時42分進入上海空域,預計約30分鐘後抵達浦東國際機場,開展他上任後首次訪華之旅。"
>>> s
>>> s.split()

How should I do this so it handles I18N? Does this make sense in all languages?

I'm on python 2.5.4 if that matters.

like image 327
Paul Tarjan Avatar asked Nov 15 '09 20:11

Paul Tarjan

2 Answers

Chinese doesn't usually have whitespace between words, and the symbols can have different meanings depending on context. You will have to understand the text in order to split it at a word boundary. In other words, what you are trying to do is not easy in general.

like image 72
Mark Byers Avatar answered Sep 22 '22 01:09

Mark Byers

For word segmentation in Chinese, and other advanced tasks in processing natural language, consider NLTK as a good starting point if not a complete solution -- it's a rich Python-based toolkit, particularly good for learning about NL processing techniques (and not rarely good enough to offer you viable solution to some of these problems).

like image 21
Alex Martelli Avatar answered Sep 20 '22 01:09

Alex Martelli