Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Fulltext search for django : Mysql not so bad ? (vs sphinx, xapian)

I am studying fulltext search engines for django. It must be simple to install, fast indexing, fast index update, not blocking while indexing, fast search.

After reading many web pages, I put in short list : Mysql MYISAM fulltext, djapian/python-xapian, and django-sphinx I did not choose lucene because it seems complex, nor haystack as it has less features than djapian/django-spĥinx (like fields weighting).

Then I made some benchmarks, to do so, I collected many free books on the net to generate a database table with 1 485 000 records (id,title,body), each record is about 600 bytes long. From the database, I also generated a list of 100 000 existing words and shuffled them to create a search list. For the tests, I made 2 runs on my laptop (4Go RAM, Dual core 2.0Ghz): the first one, just after a server reboot to clear all caches, the second is done juste after in order to test how good are cached results. Here are the "home made" benchmark results :

1485000 records with Title (150 bytes) and body (450 bytes)

Mysql 5.0.75/Ubuntu 9.04 Fulltext :
==========================================================================

Full indexing : 7m14.146s

1 thread, 1000 searchs with single word randomly taken from database : 
First run : 0:01:11.553524
next run : 0:00:00.168508

Mysql 5.5.4 m3/Ubuntu 9.04 Fulltext :
==========================================================================

Full indexing : 6m08.154s

1 thread, 1000 searchs with single word randomly taken from database : 
First run : 0:01:09.553524
next run : 0:00:20.316903

1 thread, 100000 searchs with single word randomly taken from database : 
First run : 9m09s
next run : 5m38s

1 thread, 10000 random strings (random strings should not be found in database) :
just after the 100000 search test : 0:00:15.007353

1 thread, boolean search : 1000 x (+word1 +word2) 
First run : 0:00:21.205404
next run : 0:00:00.145098

Djapian Fulltext : 
==========================================================================

Full indexing : 84m7.601s

1 thread, 1000 searchs with single word randomly taken from database with prefetch : 
First run : 0:02:28.085680
next run : 0:00:14.300236

python-xapian Fulltext :
==========================================================================

1 thread, 1000 searchs with single word randomly taken from database : 
First run : 0:01:26.402084
next run : 0:00:00.695092

django-sphinx Fulltext :
==========================================================================

Full indexing : 1m25.957s

1 thread, 1000 searchs with single word randomly taken from database : 
First run : 0:01:30.073001
next run : 0:00:05.203294

1 thread, 100000 searchs with single word randomly taken from database : 
First run : 12m48s
next run : 9m45s

1 thread, 10000 random strings (random strings should not be found in database) :
just after the 100000 search test : 0:00:23.535319

1 thread, boolean search : 1000 x (word1 word2) 
First run : 0:00:20.856486
next run : 0:00:03.005416

As you can see, Mysql is not so bad at all for fulltext search. In addition, its query cache is very efficient.

Mysql seems to me a good choice as there is nothing to install (I need just to write a small script to synchronize an Innodb production table to a MyISAM search table) and as I do not really need advanced search feature like stemming etc...

Here is the question : What do you think about Mysql fulltext search engine vs sphinx and xapian ?

like image 427
Eric Avatar asked May 05 '10 21:05

Eric


3 Answers

I haven't tested Xapian but I did a presentation last year comparing fulltext solutions: http://www.slideshare.net/billkarwin/practical-full-text-search-with-my-sql

Sphinx is the fastest at searches. But it's hard to index data that comes in incrementally, because adding data to an index is about as expensive as creating the whole index from scratch.

So some people maintain two Sphinx indexes: one large index with archived data, and one small index with recent data. Periodically (e.g. weekly) they merge the recent index into the archived index (merging two indexes is less expensive), and truncate the small index to prepare for the new week. This works great for something like a forum, but not as good for a wiki.

You might also check out Apache Solr. This is a wrapper for Lucene, and it makes using Lucene a lot easier and yet more featureful. I didn't know about Solr when I designed that presentation.

The Washington Times is an example of a project that uses Solr together with Django:

  • http://www.screeley.com/djangosolr/
  • http://www.chrisumbel.com/article/django_solr
like image 183
Bill Karwin Avatar answered Sep 23 '22 04:09

Bill Karwin


If you can get by with MyISAM's fulltext then great. It's certainly handy having it built-in to the database so you can easily and relatively-efficiently perform searches with joins to other tables. And boolean mode search is great.

The downside is it's pretty rudimentary at matching words. Obviously no stemming, but also no special handling of hyphen/apostrophe, and the default min word length and stop list is brutally excessive. (When software thinks “howbeit” is a commonly-typed word, worry!)

Worst: of course it's exclusive to nasty old MyISAM, so it's not going to sit nicely in your InnoDB tables. (You are using InnoDB, right?)

like image 28
bobince Avatar answered Sep 23 '22 04:09

bobince


You also might consider SphinxQL which sort of combines the ease of using MySQL's full text capabilities with the power and flexibility of Sphinx.

Installation instructions here

like image 34
Ty W Avatar answered Sep 24 '22 04:09

Ty W