Writing to Lucene index, one document at a time, slows down over time

Tags:

We have a program, which runs continually, does various things, and changes some records in our database. Those records are indexed using Lucene. So each time we change an entity we do something like:

open db transaction, open Lucene IndexWriter
make the changes to the db in the transaction, and update that entity in Lucene by using indexWriter.deleteDocuments(..) then indexWriter.addDocument(..).
If all went well, commit the db transaction and commit the IndexWriter.

This works fine, but over time, the indexWriter.commit() takes more and more time. Initially it takes about 0.5 seconds but after a few hundred such transactions it takes more than 3 seconds. I don't doubt it would take even longer if the script ran longer.

My solution so far has been to comment out the indexWriter.addDocument(..) and indexWriter.commit(), and recreate the entire index every now and again by first using indexWriter.deleteAll() then re-adding all documents, within one Lucene transction/IndexWriter (about 250k documents in about 14 seconds). But this obviously goes against the transactional approach offered by databases and Lucene, which keeps the two in sync, and keeps the updates to the database visible to users of our tools who are searching using Lucene.

It seems strange that I can add 250k documents in 14 seconds, but adding 1 document takes 3 seconds. What am I doing wrong, how can I improve the situation?

569

asked Aug 28 '15 11:08

Adrian Smith

1 Answers

What you are doing wrong is assuming that Lucene's built-in transactional capabilities have performance and guarantees comparable to a typical relational database, when they really don't. More specifically in your case, a commit syncs all index files with the disk, making commit times proportional to index size. That is why over time your indexWriter.commit() takes more and more time. The Javadoc for IndexWriter.commit() even warns that:

This may be a costly operation, so you should test the cost in your application and do it only when really necessary.

Can you imagine database documentation telling you to avoid doing commits?

Since your main goal seems to be to keep database updates visible through Lucene searches in a timely manner, to improve the situation, do the following:

Have indexWriter.deleteDocuments(..) and indexWriter.addDocument(..) trigger after a successful database commit, instead of before
Perform indexWriter.commit() periodically instead of every transaction, just to make sure your changes are eventually written to disk
Use a SearcherManager for searching and invoke maybeRefresh() periodically to see updated documents within a reasonable time frame

The following is an example program which demonstrates how document updates can be retrieved by periodically performing maybeRefresh(). It builds an index of 100000 documents, uses a ScheduledExecutorService to set up periodic invocations of commit() and maybeRefresh(), prompts you to update a single document, then repeatedly searches until the update is visible. All resources are properly cleaned up on program termination. Note that the controlling factor for when the update becomes visible is when maybeRefresh() is invoked, not commit().

import java.io.IOException;
import java.nio.file.Paths;
import java.util.Scanner;
import java.util.concurrent.*;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.*;
import org.apache.lucene.index.*;
import org.apache.lucene.search.*;
import org.apache.lucene.store.FSDirectory;

public class LucenePeriodicCommitRefreshExample {
    ScheduledExecutorService scheduledExecutor;
    MyIndexer indexer;
    MySearcher searcher;

    void init() throws IOException {
        scheduledExecutor = Executors.newScheduledThreadPool(3);
        indexer = new MyIndexer();
        indexer.init();
        searcher = new MySearcher(indexer.indexWriter);
        searcher.init();
    }

    void destroy() throws IOException {
        searcher.destroy();
        indexer.destroy();
        scheduledExecutor.shutdown();
    }

    class MyIndexer {
        IndexWriter indexWriter;
        Future commitFuture;

        void init() throws IOException {
            indexWriter = new IndexWriter(FSDirectory.open(Paths.get("C:\\Temp\\lucene-example")), new IndexWriterConfig(new StandardAnalyzer()));
            indexWriter.deleteAll();
            for (int i = 1; i <= 100000; i++) {
                add(String.valueOf(i), "whatever " + i);
            }
            indexWriter.commit();
            commitFuture = scheduledExecutor.scheduleWithFixedDelay(() -> {
                try {
                    indexWriter.commit();
                } catch (IOException e) {
                    e.printStackTrace();
                }
            }, 5, 5, TimeUnit.MINUTES);
        }

        void add(String id, String text) throws IOException {
            Document doc = new Document();
            doc.add(new StringField("id", id, Field.Store.YES));
            doc.add(new StringField("text", text, Field.Store.YES));
            indexWriter.addDocument(doc);
        }

        void update(String id, String text) throws IOException {
            indexWriter.deleteDocuments(new Term("id", id));
            add(id, text);
        }

        void destroy() throws IOException {
            commitFuture.cancel(false);
            indexWriter.close();
        }
    }

    class MySearcher {
        IndexWriter indexWriter;
        SearcherManager searcherManager;
        Future maybeRefreshFuture;

        public MySearcher(IndexWriter indexWriter) {
            this.indexWriter = indexWriter;
        }

        void init() throws IOException {
            searcherManager = new SearcherManager(indexWriter, true, null);
            maybeRefreshFuture = scheduledExecutor.scheduleWithFixedDelay(() -> {
                try {
                    searcherManager.maybeRefresh();
                } catch (IOException e) {
                    e.printStackTrace();
                }
            }, 0, 5, TimeUnit.SECONDS);
        }

        String findText(String id) throws IOException {
            IndexSearcher searcher = null;
            try {
                searcher = searcherManager.acquire();
                TopDocs topDocs = searcher.search(new TermQuery(new Term("id", id)), 1);
                return searcher.doc(topDocs.scoreDocs[0].doc).getField("text").stringValue();
            } finally {
                if (searcher != null) {
                    searcherManager.release(searcher);
                }
            }
        }

        void destroy() throws IOException {
            maybeRefreshFuture.cancel(false);
            searcherManager.close();
        }
    }

    public static void main(String[] args) throws IOException {
        LucenePeriodicCommitRefreshExample example = new LucenePeriodicCommitRefreshExample();
        example.init();
        Runtime.getRuntime().addShutdownHook(new Thread() {
            @Override
            public void run() {
                try {
                    example.destroy();
                } catch (IOException e) {
                    e.printStackTrace();
                }
            }
        });

        try (Scanner scanner = new Scanner(System.in)) {
            System.out.print("Enter a document id to update (from 1 to 100000): ");
            String id = scanner.nextLine();
            System.out.print("Enter what you want the document text to be: ");
            String text = scanner.nextLine();
            example.indexer.update(id, text);
            long startTime = System.nanoTime();
            String foundText;
            do {
                foundText = example.searcher.findText(id);
            } while (!text.equals(foundText));
            long elapsedTimeMillis = TimeUnit.NANOSECONDS.toMillis(System.nanoTime() - startTime);
            System.out.format("it took %d milliseconds for the searcher to see that document %s is now '%s'\n", elapsedTimeMillis, id, text);
        } catch (Exception e) {
            e.printStackTrace();
        } finally {
            System.exit(0);
        }
    }
}

This example was successfully tested using Lucene 5.3.1 and JDK 1.8.0_66.

158

answered Sep 19 '22 22:09

heenenee

Related questions
                            
                                Eager fetching in a Spring Specification
                            
                                What's difference between hibernate caching and Spring framework cache?
                            
                                Is there an AspectJ pointcut expression that searches all subpackages?
                            
                                Can I use Hibernate with JTA?
                            
                                Java: Simple HTTP Server application that responds in JSON
                            
                                How does NDK work in Android - What is the order that NDK, JNI etc are used?
                            
                                IntelliJ Idea generated source
                            
                                What is the purpose of List<?> if one can only insert a null value?
                            
                                How to fix "ssl_error_no_cypher_overlap" on a Tomcat 7 server?
                            
                                Branch prediction in a java for loop
                            
                                How to check whether file is gzip or not in Java
                            
                                How to fill a form with Jsoup?
                            
                                Why proguard does not obfuscate method body?
                            
                                Java- Best practice for getters, single getter or multiple for different variables?
                            
                                In java network programming, is there a way to keep the Server side open even when the Client side shuts down?
                            
                                Why changes in sublist are reflected in the original list?
                            
                                NumberFormat text field without commas
                            
                                Java 8 - best way converting array elements
                            
                                Randomly select a node from a Binary Tree
                            
                                Any Method which does the function opposite to that of retain all?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Writing to Lucene index, one document at a time, slows down over time

Tags:

java

performance

indexing

lucene

Adrian Smith

People also ask

1 Answers

heenenee

Recent Activity

Donate For Us