Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why is whoosh commit so slow

Tags:

python

whoosh

I wonder why whoosh is kinda slow with the following code. Especially the commit takes quite a long time.

I tried to use limitmb=2048 with the writer instead of the default 128, but it makes almost no difference. As per suggestions I tried procs=3 for the writer, which makes the indexing a little faster, but the commit even slower. Also commit(merge=False) doesn't help here, since the index is empty.

I get results like this:

index_documents 12.41 seconds
commit 22.79 seconds
run 35.34 seconds

Which for such a small schema and roughly 45000 objects seems a bit much.

I tested with whoosh 2.5.7 and Python 2.7.

Is that normal and I just expect too much, or am I doing something wrong?

I also profiled a little and it seems like whoosh is writing out and then reading in lots of pickles. It seems to be related to how the transactions are handled.

from contextlib import contextmanager
from whoosh import fields
from whoosh.analysis import NgramWordAnalyzer
from whoosh.index import create_in
import functools
import itertools
import tempfile
import shutil
import time


def timecall(f):
    @functools.wraps(f)
    def wrapper(*args, **kw):
        start = time.time()
        result = f(*args, **kw)
        end = time.time()
        print "%s %.2f seconds" % (f.__name__, end - start)
        return result
    return wrapper


def schema():
    return fields.Schema(
        path=fields.ID(stored=True, unique=True),
        text=fields.TEXT(analyzer=NgramWordAnalyzer(2, 4), stored=False, phrase=False))


@contextmanager
def create_index():
    directory = tempfile.mkdtemp()
    try:
        yield create_in(directory, schema())
    finally:
        shutil.rmtree(directory)


def iter_documents():
    for root in ('egg', 'ham', 'spam'):
        for i in range(1000, 16000):
            yield {
                u"path": u"/%s/%s" % (root, i),
                u"text": u"%s %s" % (root, i)}


@timecall
def index_documents(writer):
    start = time.time()
    counter = itertools.count()
    for doc in iter_documents():
        count = counter.next()
        current = time.time()
        if (current - start) > 1:
            print count
            start = current
        writer.add_document(**doc)


@timecall
def commit(writer):
    writer.commit()


@timecall
def run():
    with create_index() as ix:
        writer = ix.writer()
        index_documents(writer)
        commit(writer)


if __name__ == '__main__':
    run()
like image 625
fschulze Avatar asked Jun 17 '14 11:06

fschulze


1 Answers

There is some sort of merging of segments happening in the commit; this also explain why procs=3 makes things even slower (more segments to merge!).

For me the solution was to set the multisegment=True, as suggested here.

writer = ix.writer(procs=4, limitmb=256, multisegment=True)

You can adjust your procs and limitmb as you wish, but consider that limitmb is per procs! (i.e. they get multiplied)


Caveat: there has a penalty regarding the search speed. For example:

  • 10000 documents: ~200ms (w/o multisegment) vs 1.1secs (with multisegment)

  • 50000 documents: ~60ms (w/o multisegment) vs ~100ms (with multisegment)

Very roughly 40% slower on my system on the commit only. I didn't measure indexing times, but multisegment is also way faster.

This might be the solution for prototyping. Once you know you have the desired Schema and parameters, then you can set the multisegment back to False, and run it again.

so just to give a rough idea of th

like image 129
toto_tico Avatar answered Nov 06 '22 04:11

toto_tico