Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Django Haystack/ElasticSearch indexing process aborted

I'm running a setup with django 1.4, Haystack 2 beta, and ElasticSearch .20. My database is postgresql 9.1, which has several million records. When I try to index all of my data with haystack/elasticsearch, the process times out and I get a message that just says "Killed". So far I've noticed the following:

  1. I do get the number of documents to get indexed, so I'm not getting an error like, "0 documents to index".
  2. Indexing a small set, for example 1000, works just fine.
  3. I've tried hardcoding the timeout in haystack/backends/__init__.py and that seems to have no effect.
  4. I've tried changing options in the elasticsearch.yml also to no avail.

If hardcoding the timeout doesn't work, then how else can I extend the time for indexing? Is there another way to change this directly in ElasticSearch? Or perhaps some batch processing method?

Thanks in advance!

like image 798
maximus Avatar asked Jan 14 '23 09:01

maximus


1 Answers

I'd venture that the issue is with generating the documents to send to ElasticSearch, and that using the batch-size option will help you out.

The update method in the ElasticSearch backend prepares the documents to index from each provided queryset and then does a single bulk insert for that queryset.

self.conn.bulk_index(self.index_name, 'modelresult', prepped_docs, id_field=ID)

So it looks like if you've got a table with millions of records, running update_index on that indexed model will mean you need to generate those millions of documents and then index them. I would venture this is where the problem is. Setting a batch limit with the --batch-size option should limit the documents generated by queryset slices of your batch size.

like image 81
bennylope Avatar answered Jan 18 '23 22:01

bennylope