Index a MySQL database with Apache Lucene, and keep them synchronized

Tags:

When a new item is added in MySQL, it must be also indexed by Lucene.
When an existing item is removed from MySQL, it must be also removed from Lucene's index.

The idea is to write a script that will be called every x minutes via a scheduler (e.g. a CRON task). This is a way to keep MySQL and Lucene synchronized. What I managed until yet:

For each new added item in MySQL, Lucene indexes it too.
For each already added item in MySQL, Lucene does not reindex it (no duplicated items).

This is the point I'm asking you some help to manage:

For each previously added item that has been then removed from MySQL, Lucene should also unindex it.

Here is the code I used, which tries to index a MySQL table tag (id [PK] | name):

public static void main(String[] args) throws Exception {

    Class.forName("com.mysql.jdbc.Driver").newInstance();
    Connection connection = DriverManager.getConnection("jdbc:mysql://localhost/mydb", "root", "");
    StandardAnalyzer analyzer = new StandardAnalyzer(Version.LUCENE_36);
    IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_36, analyzer);
    IndexWriter writer = new IndexWriter(FSDirectory.open(INDEX_DIR), config);

    String query = "SELECT id, name FROM tag";
    Statement statement = connection.createStatement();
    ResultSet result = statement.executeQuery(query);

    while (result.next()) {
        Document document = new Document();
        document.add(new Field("id", result.getString("id"), Field.Store.YES, Field.Index.NOT_ANALYZED));
        document.add(new Field("name", result.getString("name"), Field.Store.NO, Field.Index.ANALYZED));
        writer.updateDocument(new Term("id", result.getString("id")), document);
    }

    writer.close();

}

PS: this code is for tests purpose only, no need to tell me how awful it is :)

EDIT:

One solution could be to delete any previsouly added document, and reindex all the database:

writer.deleteAll();
while (result.next()) {
    Document document = new Document();
    document.add(new Field("id", result.getString("id"), Field.Store.YES, Field.Index.NOT_ANALYZED));
    document.add(new Field("name", result.getString("name"), Field.Store.NO, Field.Index.ANALYZED));
    writer.addDocument(document);
}

I'm not sure it's the most optimized solution, is it?

481

asked May 31 '12 09:05

sp00m

1 Answers

As long as you let the indexing/reindexing run separately from your application, you will have synchronization problems. Depending on your field of work, this might not be a problem, but for many concurrent-user-applications it is.

We had the same problems when we had a job system running asynchronous indexing every few minutes. Users would find a product using the search engine, then even when an administrative person removed the product from the valid product stack, still found it in the frontend, until the next reindexing job ran. This leads to very confusing and seldomly reproducable errors reported to first level support.

We saw two possibilities: Either connect the business logic tightly to updates of the search index, or implement a tighter asynchronous update task. We did the latter.

In the background, there's a class running in a dedicated thread inside the tomcat application that takes updates and runs them in parallel. The waiting times for backoffice updates to frontend are down to 0.5-2 seconds, which greatly reduces the problems for first level support. And, it is as loosely coupled as can be, we could even implement a different indexing engine.

137

answered Sep 28 '22 05:09

0xCAFEBABE

Related questions
                            
                                Running a Java application on a webpage
                            
                                JPA EntityManager Static or Instance?
                            
                                Java Security vs. ESAPI
                            
                                A special character is appended before §
                            
                                Open Source ETL framework [closed]
                            
                                Spring 3.x - how do I redirect from a mapping that returns data?
                            
                                Is there an async I/O based Aws java client?
                            
                                Where can I find detailed information on how AWT interacts with the native OS?
                            
                                How does Java HashMap store entries internally
                            
                                How to create an AppleScript- or Command-file to launch a Java application on Mac OS?
                            
                                Get raw post reply from Jsoup
                            
                                java.lang.SecurityException: sealing violation:
                            
                                Java swing setMaximumSize not working [duplicate]
                            
                                What is Implicit constructors on Java
                            
                                ajaxStatus for specific component only
                            
                                java method returning an instance of Class<T extends Somethng>
                            
                                difference between FSDirectory and MMap Directory?
                            
                                General-purpose distributed scheduling library for Java [closed]
                            
                                How do you set and pass a parameter to a BIRT report created by the BIRT Report Designer through the BIRT API?
                            
                                What does String.substring exactly do in Java?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Index a MySQL database with Apache Lucene, and keep them synchronized

Tags:

java

synchronization

indexing

mysql

lucene

sp00m

People also ask

1 Answers

0xCAFEBABE

Recent Activity

Donate For Us