Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Index a MySQL database with Apache Lucene, and keep them synchronized

  1. When a new item is added in MySQL, it must be also indexed by Lucene.
  2. When an existing item is removed from MySQL, it must be also removed from Lucene's index.

The idea is to write a script that will be called every x minutes via a scheduler (e.g. a CRON task). This is a way to keep MySQL and Lucene synchronized. What I managed until yet:

  1. For each new added item in MySQL, Lucene indexes it too.
  2. For each already added item in MySQL, Lucene does not reindex it (no duplicated items).

This is the point I'm asking you some help to manage:

  1. For each previously added item that has been then removed from MySQL, Lucene should also unindex it.

Here is the code I used, which tries to index a MySQL table tag (id [PK] | name):

public static void main(String[] args) throws Exception {

    Class.forName("com.mysql.jdbc.Driver").newInstance();
    Connection connection = DriverManager.getConnection("jdbc:mysql://localhost/mydb", "root", "");
    StandardAnalyzer analyzer = new StandardAnalyzer(Version.LUCENE_36);
    IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_36, analyzer);
    IndexWriter writer = new IndexWriter(FSDirectory.open(INDEX_DIR), config);

    String query = "SELECT id, name FROM tag";
    Statement statement = connection.createStatement();
    ResultSet result = statement.executeQuery(query);

    while (result.next()) {
        Document document = new Document();
        document.add(new Field("id", result.getString("id"), Field.Store.YES, Field.Index.NOT_ANALYZED));
        document.add(new Field("name", result.getString("name"), Field.Store.NO, Field.Index.ANALYZED));
        writer.updateDocument(new Term("id", result.getString("id")), document);
    }

    writer.close();

}

PS: this code is for tests purpose only, no need to tell me how awful it is :)

EDIT:

One solution could be to delete any previsouly added document, and reindex all the database:

writer.deleteAll();
while (result.next()) {
    Document document = new Document();
    document.add(new Field("id", result.getString("id"), Field.Store.YES, Field.Index.NOT_ANALYZED));
    document.add(new Field("name", result.getString("name"), Field.Store.NO, Field.Index.ANALYZED));
    writer.addDocument(document);
}

I'm not sure it's the most optimized solution, is it?

like image 481
sp00m Avatar asked May 31 '12 09:05

sp00m


People also ask

Why is Lucene so fast?

Why is Lucene faster? Lucene is very fast at searching for data because of its inverted index technique. Normally, datasources structure the data as an object or record, which in turn have fields and values.

Does Lucene use a database?

Lucene is not a database — as I mentioned earlier, it's just a Java library.


1 Answers

As long as you let the indexing/reindexing run separately from your application, you will have synchronization problems. Depending on your field of work, this might not be a problem, but for many concurrent-user-applications it is.

We had the same problems when we had a job system running asynchronous indexing every few minutes. Users would find a product using the search engine, then even when an administrative person removed the product from the valid product stack, still found it in the frontend, until the next reindexing job ran. This leads to very confusing and seldomly reproducable errors reported to first level support.

We saw two possibilities: Either connect the business logic tightly to updates of the search index, or implement a tighter asynchronous update task. We did the latter.

In the background, there's a class running in a dedicated thread inside the tomcat application that takes updates and runs them in parallel. The waiting times for backoffice updates to frontend are down to 0.5-2 seconds, which greatly reduces the problems for first level support. And, it is as loosely coupled as can be, we could even implement a different indexing engine.

like image 137
0xCAFEBABE Avatar answered Sep 28 '22 05:09

0xCAFEBABE