Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Split a sentence into separate words

I need to split a Chinese sentence into separate words. The problem with Chinese is that there are no spaces. For example, the sentence may look like: 主楼怎么走 (with spaces it would be: 主楼 怎么 走).

At the moment I can think of one solution. I have a dictionary with Chinese words (in a database). The script will:

  1. try to find the first two characters of the sentence in the database (主楼),

  2. if 主楼 is actually a word and it's in the database the script will try to find first three characters (主楼怎). 主楼怎 is not a word, so it's not in the database => my application now knows that 主楼 is a separate word.

  3. try to do it with the rest of the characters.

I don't really like this approach, because to analyze even a small text it would query the database too many times.

Are there any other solutions to this?

like image 964
Peterim Avatar asked Jan 26 '10 15:01

Peterim


People also ask

How do you split a sentence into a list of words?

Method 1: Split a sentence into a list using split() The simplest approach provided by Python to convert the given list of Sentences into words with separate indices is to use split() method. This method split a string into a list where each word is a list item.

What does the split () method return from a list of words?

Python string method split() returns a list of all the words in the string, using str as the separator (splits on all whitespace if left unspecified), optionally limiting the number of splits to num.

How do I separate words from a string?

The split() method splits a string into an array of substrings. The split() method returns the new array. The split() method does not change the original string. If (" ") is used as separator, the string is split between words.

How do you split text in words in Python?

A string can be split into substrings using the split(param) method. This method is part of the string object. The parameter is optional, but you can split on a specific string or character. Given a sentence, the string can be split into words.


2 Answers

Thanks to everyone for you help!

After a little research I've found some working tools (having in mind all your suggestions), that's why I'm answering my own question.

  1. A PHP class (http://www.phpclasses.org/browse/package/2431.html)

  2. A Drupal module, basically another PHP solution with 4 different segmentation algorithms (pretty easy to understand how it works) (http://drupal.org/project/csplitter)

  3. A PHP extension for Chinese word segmentation (http://code.google.com/p/phpcws/)

  4. There are some other solutions availabe if you try searching baidu.com for "中文分词"

Sincerely,

Equ

like image 198
Peterim Avatar answered Oct 14 '22 04:10

Peterim


You might want to consider using a trie data structure. You first construct the trie from the dictionary then searching for valid words will be much faster. The advantage is determining if you are at the end of a word or need to continue looking for longer words is very fast.

like image 27
Vincent Ramdhanie Avatar answered Oct 14 '22 06:10

Vincent Ramdhanie