apt-get install python-sphinx
apt-get install sphinxsearch
mkdir rest
cd rest/
sphinx-quickstart
i create my first article in restructured text.
http://s.yunio.com/!LrAsu
please download it and untar it on your computer,cd into /rest/build/html
,open index.rst with your chrome.
i found that in restructured text search function:
1.can not search chinese character
2.can not search short words
please see attatchment 1,it is my target article to be searched
you can see is
and 标准
in the text.
please see attatchment 2,can not search chinese character 标准
which is in the text.
please see attatchment 3,can not search short words is
which is in the text.
how can i solve the problem?
Edit:
Sphinx only build index for a whole chinese sentence since there is no space in it and Sphinx doesn't know where to split words to build indexes. Check the file searchindex.js
for the indexes generated.
Try search the word '标准表达方式', it works. ^_^
Sphinx build indexes using a python scrpit search.py
. Looking into it we can find
stopwords = set("""
a and are as at
be but by
for
if in into is it
near no not
of on or
such
that the their then there these they this to
was will with
""".split())
That is why short words cannot be found. You can remove these words from this list if you just want them to appear in index.
We can also find this line:
word_re = re.compile(r'\w+(?u)')
This is the regular expression that is used by Sphinx to split words. Now we can see why it cannot index chinese words.
The solution is to add chinese word split support into this file. Someone has already done it: http://hyry.dip.jp/tech/blog/index.html?id=374
Answer for Sphinx search engine:
I leave it here in case others may find it useful. Thanks for mzjn to point it out.
Sphinx do not support Chinese by default since it cannot recognize chinese charset. It doesn't know where to split words to build indexes. You need to modify the configuration file to let it do indexing for Chinese words.
More specifically, you should modify charset_table
, ngram_len
, ngram_chars
in sphinx.conf
to make it work. You can google these keywords for the proper configuration.
However, Sphinx may generate a huge index since every single chinese character is treated as a word. So try coreseek instead if you really want to build index for chinese documents.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With