Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Writing to Lucene index, one document at a time, slows down over time

We have a program, which runs continually, does various things, and changes some records in our database. Those records are indexed using Lucene. So each time we change an entity we do something like:

  1. open db transaction, open Lucene IndexWriter
  2. make the changes to the db in the transaction, and update that entity in Lucene by using indexWriter.deleteDocuments(..) then indexWriter.addDocument(..).
  3. If all went well, commit the db transaction and commit the IndexWriter.

This works fine, but over time, the indexWriter.commit() takes more and more time. Initially it takes about 0.5 seconds but after a few hundred such transactions it takes more than 3 seconds. I don't doubt it would take even longer if the script ran longer.

My solution so far has been to comment out the indexWriter.addDocument(..) and indexWriter.commit(), and recreate the entire index every now and again by first using indexWriter.deleteAll() then re-adding all documents, within one Lucene transction/IndexWriter (about 250k documents in about 14 seconds). But this obviously goes against the transactional approach offered by databases and Lucene, which keeps the two in sync, and keeps the updates to the database visible to users of our tools who are searching using Lucene.

It seems strange that I can add 250k documents in 14 seconds, but adding 1 document takes 3 seconds. What am I doing wrong, how can I improve the situation?

like image 569
Adrian Smith Avatar asked Aug 28 '15 11:08

Adrian Smith


People also ask

Why is Lucene so fast?

Why is Lucene faster? Lucene is very fast at searching for data because of its inverted index technique. Normally, datasources structure the data as an object or record, which in turn have fields and values.

Does Lucene use LSM?

LSM has become the defacto standard for NoSql databases and is used not only by Lucene but also by Google BigTable, Apache Hbase, Apache Cassandra and many others. The details of each implementation vary as does the number and types of files used.


1 Answers

What you are doing wrong is assuming that Lucene's built-in transactional capabilities have performance and guarantees comparable to a typical relational database, when they really don't. More specifically in your case, a commit syncs all index files with the disk, making commit times proportional to index size. That is why over time your indexWriter.commit() takes more and more time. The Javadoc for IndexWriter.commit() even warns that:

This may be a costly operation, so you should test the cost in your application and do it only when really necessary.

Can you imagine database documentation telling you to avoid doing commits?

Since your main goal seems to be to keep database updates visible through Lucene searches in a timely manner, to improve the situation, do the following:

  1. Have indexWriter.deleteDocuments(..) and indexWriter.addDocument(..) trigger after a successful database commit, instead of before
  2. Perform indexWriter.commit() periodically instead of every transaction, just to make sure your changes are eventually written to disk
  3. Use a SearcherManager for searching and invoke maybeRefresh() periodically to see updated documents within a reasonable time frame

The following is an example program which demonstrates how document updates can be retrieved by periodically performing maybeRefresh(). It builds an index of 100000 documents, uses a ScheduledExecutorService to set up periodic invocations of commit() and maybeRefresh(), prompts you to update a single document, then repeatedly searches until the update is visible. All resources are properly cleaned up on program termination. Note that the controlling factor for when the update becomes visible is when maybeRefresh() is invoked, not commit().

import java.io.IOException;
import java.nio.file.Paths;
import java.util.Scanner;
import java.util.concurrent.*;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.*;
import org.apache.lucene.index.*;
import org.apache.lucene.search.*;
import org.apache.lucene.store.FSDirectory;

public class LucenePeriodicCommitRefreshExample {
    ScheduledExecutorService scheduledExecutor;
    MyIndexer indexer;
    MySearcher searcher;

    void init() throws IOException {
        scheduledExecutor = Executors.newScheduledThreadPool(3);
        indexer = new MyIndexer();
        indexer.init();
        searcher = new MySearcher(indexer.indexWriter);
        searcher.init();
    }

    void destroy() throws IOException {
        searcher.destroy();
        indexer.destroy();
        scheduledExecutor.shutdown();
    }

    class MyIndexer {
        IndexWriter indexWriter;
        Future commitFuture;

        void init() throws IOException {
            indexWriter = new IndexWriter(FSDirectory.open(Paths.get("C:\\Temp\\lucene-example")), new IndexWriterConfig(new StandardAnalyzer()));
            indexWriter.deleteAll();
            for (int i = 1; i <= 100000; i++) {
                add(String.valueOf(i), "whatever " + i);
            }
            indexWriter.commit();
            commitFuture = scheduledExecutor.scheduleWithFixedDelay(() -> {
                try {
                    indexWriter.commit();
                } catch (IOException e) {
                    e.printStackTrace();
                }
            }, 5, 5, TimeUnit.MINUTES);
        }

        void add(String id, String text) throws IOException {
            Document doc = new Document();
            doc.add(new StringField("id", id, Field.Store.YES));
            doc.add(new StringField("text", text, Field.Store.YES));
            indexWriter.addDocument(doc);
        }

        void update(String id, String text) throws IOException {
            indexWriter.deleteDocuments(new Term("id", id));
            add(id, text);
        }

        void destroy() throws IOException {
            commitFuture.cancel(false);
            indexWriter.close();
        }
    }

    class MySearcher {
        IndexWriter indexWriter;
        SearcherManager searcherManager;
        Future maybeRefreshFuture;

        public MySearcher(IndexWriter indexWriter) {
            this.indexWriter = indexWriter;
        }

        void init() throws IOException {
            searcherManager = new SearcherManager(indexWriter, true, null);
            maybeRefreshFuture = scheduledExecutor.scheduleWithFixedDelay(() -> {
                try {
                    searcherManager.maybeRefresh();
                } catch (IOException e) {
                    e.printStackTrace();
                }
            }, 0, 5, TimeUnit.SECONDS);
        }

        String findText(String id) throws IOException {
            IndexSearcher searcher = null;
            try {
                searcher = searcherManager.acquire();
                TopDocs topDocs = searcher.search(new TermQuery(new Term("id", id)), 1);
                return searcher.doc(topDocs.scoreDocs[0].doc).getField("text").stringValue();
            } finally {
                if (searcher != null) {
                    searcherManager.release(searcher);
                }
            }
        }

        void destroy() throws IOException {
            maybeRefreshFuture.cancel(false);
            searcherManager.close();
        }
    }

    public static void main(String[] args) throws IOException {
        LucenePeriodicCommitRefreshExample example = new LucenePeriodicCommitRefreshExample();
        example.init();
        Runtime.getRuntime().addShutdownHook(new Thread() {
            @Override
            public void run() {
                try {
                    example.destroy();
                } catch (IOException e) {
                    e.printStackTrace();
                }
            }
        });

        try (Scanner scanner = new Scanner(System.in)) {
            System.out.print("Enter a document id to update (from 1 to 100000): ");
            String id = scanner.nextLine();
            System.out.print("Enter what you want the document text to be: ");
            String text = scanner.nextLine();
            example.indexer.update(id, text);
            long startTime = System.nanoTime();
            String foundText;
            do {
                foundText = example.searcher.findText(id);
            } while (!text.equals(foundText));
            long elapsedTimeMillis = TimeUnit.NANOSECONDS.toMillis(System.nanoTime() - startTime);
            System.out.format("it took %d milliseconds for the searcher to see that document %s is now '%s'\n", elapsedTimeMillis, id, text);
        } catch (Exception e) {
            e.printStackTrace();
        } finally {
            System.exit(0);
        }
    }
}

This example was successfully tested using Lucene 5.3.1 and JDK 1.8.0_66.

like image 158
heenenee Avatar answered Sep 19 '22 22:09

heenenee