The idea is to write a script that will be called every x minutes via a scheduler (e.g. a CRON task). This is a way to keep MySQL and Lucene synchronized. What I managed until yet:
This is the point I'm asking you some help to manage:
Here is the code I used, which tries to index a MySQL table tag (id [PK] | name)
:
public static void main(String[] args) throws Exception {
Class.forName("com.mysql.jdbc.Driver").newInstance();
Connection connection = DriverManager.getConnection("jdbc:mysql://localhost/mydb", "root", "");
StandardAnalyzer analyzer = new StandardAnalyzer(Version.LUCENE_36);
IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_36, analyzer);
IndexWriter writer = new IndexWriter(FSDirectory.open(INDEX_DIR), config);
String query = "SELECT id, name FROM tag";
Statement statement = connection.createStatement();
ResultSet result = statement.executeQuery(query);
while (result.next()) {
Document document = new Document();
document.add(new Field("id", result.getString("id"), Field.Store.YES, Field.Index.NOT_ANALYZED));
document.add(new Field("name", result.getString("name"), Field.Store.NO, Field.Index.ANALYZED));
writer.updateDocument(new Term("id", result.getString("id")), document);
}
writer.close();
}
PS: this code is for tests purpose only, no need to tell me how awful it is :)
EDIT:
One solution could be to delete any previsouly added document, and reindex all the database:
writer.deleteAll();
while (result.next()) {
Document document = new Document();
document.add(new Field("id", result.getString("id"), Field.Store.YES, Field.Index.NOT_ANALYZED));
document.add(new Field("name", result.getString("name"), Field.Store.NO, Field.Index.ANALYZED));
writer.addDocument(document);
}
I'm not sure it's the most optimized solution, is it?
Why is Lucene faster? Lucene is very fast at searching for data because of its inverted index technique. Normally, datasources structure the data as an object or record, which in turn have fields and values.
Lucene is not a database — as I mentioned earlier, it's just a Java library.
As long as you let the indexing/reindexing run separately from your application, you will have synchronization problems. Depending on your field of work, this might not be a problem, but for many concurrent-user-applications it is.
We had the same problems when we had a job system running asynchronous indexing every few minutes. Users would find a product using the search engine, then even when an administrative person removed the product from the valid product stack, still found it in the frontend, until the next reindexing job ran. This leads to very confusing and seldomly reproducable errors reported to first level support.
We saw two possibilities: Either connect the business logic tightly to updates of the search index, or implement a tighter asynchronous update task. We did the latter.
In the background, there's a class running in a dedicated thread inside the tomcat application that takes updates and runs them in parallel. The waiting times for backoffice updates to frontend are down to 0.5-2 seconds, which greatly reduces the problems for first level support. And, it is as loosely coupled as can be, we could even implement a different indexing engine.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With