Improve multi-thread indexing with lucene

Tags:

I am trying to build my indexes in Lucene with multiple threads. So, I started my coding and wrote the following code. First I find the files and for each file, I create a thread to index it. After that I join the threads and optimize the indexes. It works but I'm not sure... can I trust it in large scale? Is there any way to improve it?

import java.io.File;
import java.io.FileFilter;
import java.io.FileReader;
import java.io.IOException;
import java.io.File;
import java.io.FileReader;
import java.io.BufferedReader;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.document.Field;
import org.apache.lucene.document.Document;
import org.apache.lucene.store.RAMDirectory;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.analysis.StopAnalyzer;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.FSDirectory;
import org.apache.lucene.util.Version;
import org.apache.lucene.index.TermFreqVector;

public class mIndexer extends Thread {

    private File ifile;
    private static IndexWriter writer;

    public mIndexer(File f) {
    ifile = f.getAbsoluteFile();
    }

    public static void main(String args[]) throws Exception {
    System.out.println("here...");

    String indexDir;
        String dataDir;
    if (args.length != 2) {
        dataDir = new String("/home/omid/Ranking/docs/");
        indexDir = new String("/home/omid/Ranking/indexes/");
    }
    else {
        dataDir = args[0];
        indexDir = args[1];
    }

    long start = System.currentTimeMillis();

    Directory dir = FSDirectory.open(new File(indexDir));
    writer = new IndexWriter(dir,
    new StopAnalyzer(Version.LUCENE_34, new File("/home/omid/Desktop/stopwords.txt")),
    true,
    IndexWriter.MaxFieldLength.UNLIMITED);
    int numIndexed = 0;
    try {
        numIndexed = index(dataDir, new TextFilesFilter());
    } finally {
        long end = System.currentTimeMillis();
        System.out.println("Indexing " + numIndexed + " files took " + (end - start) + " milliseconds");
        writer.optimize();
        System.out.println("Optimization took place in " + (System.currentTimeMillis() - end) + " milliseconds");
        writer.close();
    }
    System.out.println("Enjoy your day/night");
    }

    public static int index(String dataDir, FileFilter filter) throws Exception {
    File[] dires = new File(dataDir).listFiles();
    for (File d: dires) {
        if (d.isDirectory()) {
        File[] files = new File(d.getAbsolutePath()).listFiles();
        for (File f: files) {
            if (!f.isDirectory() &&
            !f.isHidden() &&
            f.exists() &&
            f.canRead() &&
            (filter == null || filter.accept(f))) {
                Thread t = new mIndexer(f);
                t.start();
                t.join();
            }
        }
        }
    }
    return writer.numDocs();
    }

    private static class TextFilesFilter implements FileFilter {
    public boolean accept(File path) {
        return path.getName().toLowerCase().endsWith(".txt");
    }
    }

    protected Document getDocument() throws Exception {
    Document doc = new Document();
    if (ifile.exists()) {
        doc.add(new Field("contents", new FileReader(ifile), Field.TermVector.YES));
        doc.add(new Field("path", ifile.getAbsolutePath(), Field.Store.YES, Field.Index.NOT_ANALYZED));
        String cat = "WIR";
        cat = ifile.getAbsolutePath().substring(0, ifile.getAbsolutePath().length()-ifile.getName().length()-1);
        cat = cat.substring(cat.lastIndexOf('/')+1, cat.length());
        //doc.add(new Field("category", cat.subSequence(0, cat.length()), Field.Store.YES));
        //System.out.println(cat.subSequence(0, cat.length()));
    }
    return doc;
    }

    public void run() {
    try {
        System.out.println("Indexing " + ifile.getAbsolutePath());
        Document doc = getDocument();
        writer.addDocument(doc);
    } catch (Exception e) {
        System.out.println(e.toString());
    }

    }
}

Any hep is regarded.

517

asked Feb 16 '12 19:02

orezvani

1 Answers

If you want to parallelize indexing, there are two things you can do:

parallelizing calls to addDocument,
increasing the maximum thread count of your merge scheduler.

You are on the right path to parallelize calls to addDocuments, but spawning one thread per document will not scale as the number of documents you need to index will grow. You should rather use a fixed-size ThreadPoolExecutor. Since this task is mainly CPU-intensive (depending on your analyzer and the way you retrieve your data), setting the number of CPUs of your computer as the maximum number of threads might be a good start.

Regarding the merge scheduler, you can increase the maximum number of threads which can be used with the setMaxThreadCount method of ConcurrentMergeScheduler. Beware that disks are much better at sequential reads/writes than random read/writes, as a consequence setting a too high maximum number of threads to your merge scheduler is more likely to slow indexing down than to speed it up.

But before trying to parallelizing your indexing process, you should probably try to find where the bottleneck is. If your disk is too slow, the bottleneck is likely to be the flush and the merge steps, as a consequence parallelizing calls to addDocument (which essentially consists in analyzing a document and buffering the result of the analysis in memory) will not improve indexing speed at all.

Some side notes:

There is some ongoing work in the development version of Lucene in order to improve indexing parallelism (the flushing part especially, this blog entry explains how it works).
Lucene has a nice wiki page on how to improve indexing speed where you will find other ways to improve indexing speed.

answered Sep 23 '22 06:09

jpountz

Related questions
                            
                                Does a Java heap dump include thread stacks
                            
                                Get resource from jar
                            
                                Has anyone found OSGi to be useful in corporate applications? [closed]
                            
                                Jackson JSON + Java Generics
                            
                                Spring: Inject URL for classpath resource
                            
                                Are they any decent on-disk implementations of Java's Map?
                            
                                Why is the 64bit JVM faster than the 32bit one?
                            
                                How to do `MyClass<String>.class` in Java?
                            
                                Matching (e.g.) a Unicode letter with Java regexps
                            
                                Is there a way to enforce using tabs instead of spaces in Java?
                            
                                Is float slower than double? Does 64 bit program run faster than 32 bit program?
                            
                                Is a logger per class or is a set of loggers that are accessed by the entire application perferred?
                            
                                Denying direct access to jsp pages
                            
                                akka jvm threads vs os threads when performing io
                            
                                How is the OSGi bundle start level defined?
                            
                                How to convert Java Map to a basic Javascript object?
                            
                                Defining one global UncaughtExceptionHandler for all threads of my application
                            
                                Java Thread monitoring. How to find out non daemon live threads?
                            
                                How to get the Timestamp column value in android via cursor?
                            
                                How to transition from managedQuery to LoaderManager/CursorLoader?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Improve multi-thread indexing with lucene

Tags:

java

indexing

multithreading

lucene

orezvani

People also ask

1 Answers

jpountz

Recent Activity

Donate For Us