Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to search for Chinese characters and short words in documentation generated by Sphinx?

apt-get install  python-sphinx    
apt-get install  sphinxsearch    
mkdir rest    
cd rest/    
sphinx-quickstart    

i create my first article in restructured text.
http://s.yunio.com/!LrAsu

please download it and untar it on your computer,cd into /rest/build/html,open index.rst with your chrome.

i found that in restructured text search function:

1.can not search chinese character
2.can not search short words

please see attatchment 1,it is my target article to be searched
enter image description here you can see is and 标准 in the text.

please see attatchment 2,can not search chinese character 标准 which is in the text. enter image description here please see attatchment 3,can not search short words is which is in the text. enter image description here

how can i solve the problem?

like image 845
showkey Avatar asked May 25 '13 00:05

showkey


1 Answers

Edit:

Sphinx only build index for a whole chinese sentence since there is no space in it and Sphinx doesn't know where to split words to build indexes. Check the file searchindex.js for the indexes generated.

Try search the word '标准表达方式', it works. ^_^

Sphinx build indexes using a python scrpit search.py. Looking into it we can find

stopwords = set("""
a  and  are  as  at
be  but  by
for
if  in  into  is  it
near  no  not
of  on  or
such
that  the  their  then  there  these  they  this  to
was  will  with
""".split())

That is why short words cannot be found. You can remove these words from this list if you just want them to appear in index.

We can also find this line:

word_re = re.compile(r'\w+(?u)')

This is the regular expression that is used by Sphinx to split words. Now we can see why it cannot index chinese words.

The solution is to add chinese word split support into this file. Someone has already done it: http://hyry.dip.jp/tech/blog/index.html?id=374

Answer for Sphinx search engine:

I leave it here in case others may find it useful. Thanks for mzjn to point it out.

Sphinx do not support Chinese by default since it cannot recognize chinese charset. It doesn't know where to split words to build indexes. You need to modify the configuration file to let it do indexing for Chinese words.

More specifically, you should modify charset_table, ngram_len, ngram_chars in sphinx.conf to make it work. You can google these keywords for the proper configuration.

However, Sphinx may generate a huge index since every single chinese character is treated as a word. So try coreseek instead if you really want to build index for chinese documents.

like image 181
Naruil Avatar answered Oct 25 '22 07:10

Naruil