Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Understanding Lucene leading wildcard performance

Lucene does not by default allow leading wildcards in search terms, but this can be enabled with:

QueryParser#setAllowLeadingWildcard(true)

I understand that use of a leading wildcard prevents Lucene from using the index. Searches with a leading wildcard must scan the entire index.

How do I demonstrate the performance of a leading wildcard query? When is it OK to use setAllowLeadingWildcard(true)?

I have built a test index with 10 million documents in the form:

{ name: random_3_word_phrase }

The index is 360M on disk.

My test queries perform well and I have been unable to actually demonstrate a performance problem. For example, querying for name:*ing produces over 1.1 million documents in less than 1 second. Querying name:*ing* produces over 1.5 million documents in the same time.

What is going here? Why isn't this slow? Is 10,000,000 documents not enough? Do the documents need to contains more than just a single field?

like image 247
Landon Kuhn Avatar asked Aug 01 '12 19:08

Landon Kuhn


2 Answers

Depends on how much memory you have, and how much of the token index is in memory.

A 360MB total index could be searched quite quickly on any old computer. A 360GB index would take a bit longer... ;)

As an example, I fired up an old 2GB index, and searched for "*e".

On a box with 8GB, it returned 500K hits in under 5 seconds. I tried the same index on a box with only 1GB of memory, and it took about 20 seconds.

To illustrate further, here's some generic C# code that basically does a "** E*" type search of 10 million random 3 word phrases.

static string substring = "E";

private static Random random = new Random((int)DateTime.Now.Ticks);//thanks to McAden

private static string RandomString(int size)
{
    StringBuilder builder = new StringBuilder();
    char ch;
    for (int i = 0; i < size; i++)
    {
        ch = Convert.ToChar(Convert.ToInt32(Math.Floor(26 * random.NextDouble() + 65)));
        builder.Append(ch);
    }

    return builder.ToString();
}

static void FindSubStringInPhrases()
{
    List<string> index = new List<string>();

    for (int i = 0; i < 10000000; i++)
    {
        index.Add(RandomString(5) + " " + RandomString(5) + " " + RandomString(5));
    }

    var matches = index.FindAll(SubstringPredicate);

}

static bool SubstringPredicate(string item)
{
    if (item.Contains(substring))
        return true;
    else
        return false;
}

After all 10 million phases have been loaded into the list, it still only takes about a second for "var matches = index.FindAll(SubstringPredicate);" to return over 4 million hits.

The point is, memory is fast. Once things can no longer fit into memory and you have to start swapping to disk is when you are going to see performance hits.

like image 94
GalacticJello Avatar answered Oct 05 '22 04:10

GalacticJello


If I understand it correctly, a part of the index is the term dictionary, which is a sorted list of all the indexed terms. When searching with no wildcards or a trailing wildcard, Lucene can take advantage of the fact that many terms have common prefixes. On the other hand, searching with a leading wildcard scans the entire term dictionary. This is not optimal, but the term dictionary tends to be tiny compared to other parts of the index, such as frequency and position data, so a full scan of the term dictionary is usually not big problem by itself.

like image 26
Kai Chan Avatar answered Oct 05 '22 04:10

Kai Chan