Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Lucene IndexWriter slow to add documents

I wrote a small loop which added 10,000 documents into the IndexWriter and it took for ever to do it.

Is there another way to index large volumes of documents?

I ask because when this goes live it has to load in 15,000 records.

The other question is how do I prevent having to load in all the records again when the web application is restarted?

Edit

Here is the code i used;

for (int t = 0; t < 10000; t++){
    doc = new Document();
    text = "Value" + t.toString();
    doc.Add(new Field("Value", text, Field.Store.YES, Field.Index.TOKENIZED));
    iwriter.AddDocument(doc);
};

Edit 2

        Analyzer analyzer = new StandardAnalyzer();
        Directory directory = new RAMDirectory();

        IndexWriter iwriter = new IndexWriter(directory, analyzer, true);

        iwriter.SetMaxFieldLength(25000);

then the code to add the documents, then;

        iwriter.Close();
like image 547
griegs Avatar asked Jul 21 '10 02:07

griegs


People also ask

What is IndexWriter?

Expert: IndexWriter allows an optional IndexDeletionPolicy implementation to be specified. You can use this to control when prior commits are deleted from the index. The default policy is KeepOnlyLastCommitDeletionPolicy which removes all prior commits as soon as a new commit is done.

How does Lucene build an index?

In a nutshell, Lucene builds an inverted index using Skip-Lists on disk, and then loads a mapping for the indexed terms into memory using a Finite State Transducer (FST).


2 Answers

You should do this way to get the best performance. on my machine i'm indexing 1000 document in 1 second

1) You should reuse (Document, Field) not creating every time you add a document like this

private static void IndexingThread(object contextObj)
{
     Range<int> range = (Range<int>)contextObj;
     Document newDoc = new Document();
     newDoc.Add(new Field("title", "", Field.Store.NO, Field.Index.ANALYZED));
     newDoc.Add(new Field("body", "", Field.Store.NO, Field.Index.ANALYZED));
     newDoc.Add(new Field("newsdate", "", Field.Store.YES, Field.Index.NOT_ANALYZED_NO_NORMS));
     newDoc.Add(new Field("id", "", Field.Store.YES, Field.Index.NOT_ANALYZED_NO_NORMS));

     for (int counter = range.Start; counter <= range.End; counter++)
     {
         newDoc.GetField("title").SetValue(Entities[counter].Title);
         newDoc.GetField("body").SetValue(Entities[counter].Body);
         newDoc.GetField("newsdate").SetValue(Entities[counter].NewsDate);
         newDoc.GetField("id").SetValue(Entities[counter].ID.ToString());

         writer.AddDocument(newDoc);
     }
}

After that you could use threading and break your large collection into smaller ones, and use the above code for each section for example if you have 10,000 document you can create 10 Thread using ThreadPool and feed each section to one thread for indexing

Then you will gain the best performance.

like image 100
Ehsan Avatar answered Nov 04 '22 21:11

Ehsan


Just checking, but you haven't got the debugger attached when you're running it have you?

This severely affects performance when adding documents.

On my machine (Lucene 2.0.0.4):

Built with platform target x86:

  • No debugger - 5.2 seconds

  • Debugger attached - 113.8 seconds

Built with platform target x64:

  • No debugger - 6.0 seconds

  • Debugger attached - 171.4 seconds

Rough example of saving and loading an index to and from a RAMDirectory:

const int DocumentCount = 10 * 1000;
const string IndexFilePath = @"X:\Temp\tmp.idx";

Analyzer analyzer = new StandardAnalyzer();
Directory ramDirectory = new RAMDirectory();

IndexWriter indexWriter = new IndexWriter(ramDirectory, analyzer, true);

for (int i = 0; i < DocumentCount; i++)
{
    Document doc = new Document();
    string text = "Value" + i;
    doc.Add(new Field("Value", text, Field.Store.YES, Field.Index.TOKENIZED));
    indexWriter.AddDocument(doc);
}

indexWriter.Close();

//Save index
FSDirectory fileDirectory = FSDirectory.GetDirectory(IndexFilePath, true);
IndexWriter fileIndexWriter = new IndexWriter(fileDirectory, analyzer, true);
fileIndexWriter.AddIndexes(new[] { ramDirectory });
fileIndexWriter.Close();

//Load index
FSDirectory newFileDirectory = FSDirectory.GetDirectory(IndexFilePath, false);
Directory newRamDirectory = new RAMDirectory();
IndexWriter newIndexWriter = new IndexWriter(newRamDirectory, analyzer, true);
newIndexWriter.AddIndexes(new[] { newFileDirectory });

Console.WriteLine("New index writer document count:{0}.", newIndexWriter.DocCount());
like image 26
Tim Lloyd Avatar answered Nov 04 '22 21:11

Tim Lloyd