Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Search large amounts of constantly updated text in mysql

I have a mysql database where large amounts of text are constantly added. (10 pages of text per hour). Text is stored as plaintext in text fields. Every row contacts a page or two of texts.

I need to do full text search (search for a keyword in the text and do complex queries) on this database on a regular basis. I just need to search for newly added text. But it is very important for added text to be immediately searchable (within a minute or two).

From what i've read, fulltext with mysql is very inefficient. I know lucene is an option but i'm not sure yet how quickly it can index new text.

So what are my options? is there a way to make mysql more efficient? is lucene my best solution? anything more appropriate?

thanks

like image 897
applechief Avatar asked Oct 10 '22 04:10

applechief


1 Answers

I have done benchmarking for Indexing Times for Sphinx & Solr. Sphinx is far ahead as compared to Solr with respect to Indexing Algorithms (super fast indexing times and small index size).

When you say 10 pages of text, it seems you even dont need Real Time Sphinx Indexing. You can follow the main + delta indexing scheme in Sphinx (you can find that on Sphinx Documentation). It would be superfast and near real time. If you want more help on this please feel free to ask, would be glad to explain you.

Solr is great but when it comes to optimized Algorithms Sphinx rocks!! Try Sphinx.

Coming to your questions in the comment, Solr/Lucene supports incremental indexing (known as delta imports in their terminology) and its quiet easy to configure however they are pretty slow as compared to the method used by Sphinx.

Main+Delta is quick enough because what you can do is create a temporary table store you new text in that and index that. According to the documentation:Sphinx supports "live" (almost real time) index updates and it could be implemented using so called "main+delta" scheme. The idea is to set up two sources and two indexes, with one "main" index for the data, and one "delta" for the new documents.

Say for example you have 10 Million records so you can keep that as the main index and all the new documents get added to a new table which will act as the delta. This new table can be indexed from time to time (say every 1hr) and the data gets searchable within very few seconds as you have 10 pages of text. Now after your new records are being searched you can merge the documents of the main table + delta table which can be carried out without interfering your search. When the documents are merged, empty the new table and again after an hour you can carry out the whole process again. I hope you got that else please feel free to ask any question.

like image 152
Yavar Avatar answered Oct 18 '22 16:10

Yavar