I am studying fulltext search engines for django. It must be simple to install, fast indexing, fast index update, not blocking while indexing, fast search.
After reading many web pages, I put in short list : Mysql MYISAM fulltext, djapian/python-xapian, and django-sphinx I did not choose lucene because it seems complex, nor haystack as it has less features than djapian/django-spĥinx (like fields weighting).
Then I made some benchmarks, to do so, I collected many free books on the net to generate a database table with 1 485 000 records (id,title,body), each record is about 600 bytes long. From the database, I also generated a list of 100 000 existing words and shuffled them to create a search list. For the tests, I made 2 runs on my laptop (4Go RAM, Dual core 2.0Ghz): the first one, just after a server reboot to clear all caches, the second is done juste after in order to test how good are cached results. Here are the "home made" benchmark results :
1485000 records with Title (150 bytes) and body (450 bytes)
Mysql 5.0.75/Ubuntu 9.04 Fulltext :
==========================================================================
Full indexing : 7m14.146s
1 thread, 1000 searchs with single word randomly taken from database :
First run : 0:01:11.553524
next run : 0:00:00.168508
Mysql 5.5.4 m3/Ubuntu 9.04 Fulltext :
==========================================================================
Full indexing : 6m08.154s
1 thread, 1000 searchs with single word randomly taken from database :
First run : 0:01:09.553524
next run : 0:00:20.316903
1 thread, 100000 searchs with single word randomly taken from database :
First run : 9m09s
next run : 5m38s
1 thread, 10000 random strings (random strings should not be found in database) :
just after the 100000 search test : 0:00:15.007353
1 thread, boolean search : 1000 x (+word1 +word2)
First run : 0:00:21.205404
next run : 0:00:00.145098
Djapian Fulltext :
==========================================================================
Full indexing : 84m7.601s
1 thread, 1000 searchs with single word randomly taken from database with prefetch :
First run : 0:02:28.085680
next run : 0:00:14.300236
python-xapian Fulltext :
==========================================================================
1 thread, 1000 searchs with single word randomly taken from database :
First run : 0:01:26.402084
next run : 0:00:00.695092
django-sphinx Fulltext :
==========================================================================
Full indexing : 1m25.957s
1 thread, 1000 searchs with single word randomly taken from database :
First run : 0:01:30.073001
next run : 0:00:05.203294
1 thread, 100000 searchs with single word randomly taken from database :
First run : 12m48s
next run : 9m45s
1 thread, 10000 random strings (random strings should not be found in database) :
just after the 100000 search test : 0:00:23.535319
1 thread, boolean search : 1000 x (word1 word2)
First run : 0:00:20.856486
next run : 0:00:03.005416
As you can see, Mysql is not so bad at all for fulltext search. In addition, its query cache is very efficient.
Mysql seems to me a good choice as there is nothing to install (I need just to write a small script to synchronize an Innodb production table to a MyISAM search table) and as I do not really need advanced search feature like stemming etc...
Here is the question : What do you think about Mysql fulltext search engine vs sphinx and xapian ?
I haven't tested Xapian but I did a presentation last year comparing fulltext solutions: http://www.slideshare.net/billkarwin/practical-full-text-search-with-my-sql
Sphinx is the fastest at searches. But it's hard to index data that comes in incrementally, because adding data to an index is about as expensive as creating the whole index from scratch.
So some people maintain two Sphinx indexes: one large index with archived data, and one small index with recent data. Periodically (e.g. weekly) they merge the recent index into the archived index (merging two indexes is less expensive), and truncate the small index to prepare for the new week. This works great for something like a forum, but not as good for a wiki.
You might also check out Apache Solr. This is a wrapper for Lucene, and it makes using Lucene a lot easier and yet more featureful. I didn't know about Solr when I designed that presentation.
The Washington Times is an example of a project that uses Solr together with Django:
If you can get by with MyISAM's fulltext then great. It's certainly handy having it built-in to the database so you can easily and relatively-efficiently perform searches with joins to other tables. And boolean mode search is great.
The downside is it's pretty rudimentary at matching words. Obviously no stemming, but also no special handling of hyphen/apostrophe, and the default min word length and stop list is brutally excessive. (When software thinks “howbeit” is a commonly-typed word, worry!)
Worst: of course it's exclusive to nasty old MyISAM, so it's not going to sit nicely in your InnoDB tables. (You are using InnoDB, right?)
You also might consider SphinxQL which sort of combines the ease of using MySQL's full text capabilities with the power and flexibility of Sphinx.
Installation instructions here
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With