I need to split a Chinese sentence into separate words. The problem with Chinese is that there are no spaces. For example, the sentence may look like: 主楼怎么走
(with spaces it would be: 主楼 怎么 走
).
At the moment I can think of one solution. I have a dictionary with Chinese words (in a database). The script will:
try to find the first two characters of the sentence in the database (主楼
),
if 主楼
is actually a word and it's in the database the script will try to find first three characters (主楼怎
). 主楼怎
is not a word, so it's not in the database => my application now knows that 主楼
is a separate word.
try to do it with the rest of the characters.
I don't really like this approach, because to analyze even a small text it would query the database too many times.
Are there any other solutions to this?
Method 1: Split a sentence into a list using split() The simplest approach provided by Python to convert the given list of Sentences into words with separate indices is to use split() method. This method split a string into a list where each word is a list item.
Python string method split() returns a list of all the words in the string, using str as the separator (splits on all whitespace if left unspecified), optionally limiting the number of splits to num.
The split() method splits a string into an array of substrings. The split() method returns the new array. The split() method does not change the original string. If (" ") is used as separator, the string is split between words.
A string can be split into substrings using the split(param) method. This method is part of the string object. The parameter is optional, but you can split on a specific string or character. Given a sentence, the string can be split into words.
Thanks to everyone for you help!
After a little research I've found some working tools (having in mind all your suggestions), that's why I'm answering my own question.
A PHP class (http://www.phpclasses.org/browse/package/2431.html)
A Drupal module, basically another PHP solution with 4 different segmentation algorithms (pretty easy to understand how it works) (http://drupal.org/project/csplitter)
A PHP extension for Chinese word segmentation (http://code.google.com/p/phpcws/)
There are some other solutions availabe if you try searching baidu.com for "中文分词"
Sincerely,
Equ
You might want to consider using a trie data structure. You first construct the trie from the dictionary then searching for valid words will be much faster. The advantage is determining if you are at the end of a word or need to continue looking for longer words is very fast.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With