I need to split a Chinese sentence into separate words. The problem with Chinese is that there are no spaces. For example, the sentence may look like: <code>主楼怎么走</code> (with spaces it would be: <code>主楼怎么走</code>). At the moment I can think of one solution. I have a dictionary with Chinese words (in a database). The script will: <ol> <li>try to find the first two characters of the sentence in the database (<code>主楼</code>), </li> <li>if <code>主楼</code> is actually a word and it's in the database the script will try to find first three characters (<code>主楼怎</code>). <code>主楼怎</code> is not a word, so it's not in the database => my application now knows that <code>主楼</code> is a separate word.</li> <li>try to do it with the rest of the characters.</li> </ol> I don't really like this approach, because to analyze even a small text it would query the database too many times. Are there any other solutions to this?

Thanks to everyone for you help! After a little research I've found some working tools (having in mind all your suggestions), that's why I'm answering my own question. <ol> <li>A PHP class (http://www.phpclasses.org/browse/package/2431.html)</li> <li>A Drupal module, basically another PHP solution with 4 different segmentation algorithms (pretty easy to understand how it works) (http://drupal.org/project/csplitter)</li> <li>A PHP extension for Chinese word segmentation (http://code.google.com/p/phpcws/)</li> <li>There are some other solutions availabe if you try searching baidu.com for "中文分词"</li> </ol> Sincerely, Equ

Split a sentence into separate words

2 Answers

Thanks to everyone for you help!

After a little research I've found some working tools (having in mind all your suggestions), that's why I'm answering my own question.

A PHP class (http://www.phpclasses.org/browse/package/2431.html)
A Drupal module, basically another PHP solution with 4 different segmentation algorithms (pretty easy to understand how it works) (http://drupal.org/project/csplitter)
A PHP extension for Chinese word segmentation (http://code.google.com/p/phpcws/)
There are some other solutions availabe if you try searching baidu.com for "中文分词"

Sincerely,

Equ

198

answered Oct 14 '22 04:10

Peterim

You might want to consider using a trie data structure. You first construct the trie from the dictionary then searching for valid words will be much faster. The advantage is determining if you are at the end of a word or need to continue looking for longer words is very fast.

answered Oct 14 '22 06:10

Vincent Ramdhanie

Related questions
                            
                                Merging dev and live prestashop databases
                            
                                result of prepared select statement as array
                            
                                Kerberos Authentication in PHP
                            
                                How to implement Gmail OAuth API to send email (especially via SMTP)?
                            
                                Statistical analysis for PHP [closed]
                            
                                How do I handle NULL values in a mysql SELECT ... OUTFILE statement in conjunction with FIELDS ESCAPED BY? NULL values are currently being truncated
                            
                                How to create an Email Account in Cpanel via PHP? [closed]
                            
                                How to send the browser to an error page if part of the response has been sent (chunked)
                            
                                2 different small query vs 1 query with subquery
                            
                                using php __DIR__
                            
                                composite identifier, but uses an ID generator other than manually assigning + Symfony2
                            
                                How to detect if a mobile device is emulated by Google Chrome? [closed]
                            
                                OPTIONS 405 (Method Not Allowed)
                            
                                Soft deleting / detaching and restoring / attaching relationships with composite keys
                            
                                How to guarantee a specified file is a device on BSD/Linux from PHP?
                            
                                Working out how Yii2 finds theme resources when using fallbacks
                            
                                PChart linear chart image quality
                            
                                Testing subscription renewals on Stripe
                            
                                Registration in Notification Hub with PHP Backend
                            
                                Dreamweaver extension to beautify PHP/JavaScript/jQuery code

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Split a sentence into separate words

Tags:

php

text-segmentation

cjk

multibyte

Peterim

People also ask

2 Answers

Peterim

Vincent Ramdhanie

Recent Activity

Donate For Us