Applying "tag" to millions of documents, using bulk/update methods

Tags:

elasticsearch

We have in our ElasticSearch instance about 55.000.000 of documents. We have a CSV file with user_ids, the biggest CSV has 9M entries. Our documents have user_id as the key, so this is convenient.

I am posting the question because I want to discuss and have the best option to get this done, as there are different ways to address this problem. We need to add the new "label" to the document if the user document doesn't have it yet eg tagging the user with "stackoverflow" or "github".

There is the classic partial update endpoint. This sounds way slow as we need to iterate over 9M of user_ids and issue the api call for each of them.
there is the bulk request, which provides some better performance but with limited 1000-5000 documents that can be mentioned in one call. And knowing when the batch is too large is kinda know how we need to learn on the go.
Then there is the official open issue for /update_by_query endpoint which has lots of traffic, but no confirmation it was implemented in the standard release.
On this open issue there is a mention for a update_by_query plugin which should provide some better handling, but there are old and open issues where users are complaining of performance problems and memory issues.
I am not sure it it's doable on EL, but I thought I would load all the CSV entries into a separate index, and somehow would join the two indexes and apply script that would add the tag if doesn't exists yet.

So the question remains whats the best way to do this, and if some of you have done in past this, make sure you share your numbers/performance and how you would do differently this time.

486

asked Oct 17 '14 14:10

Pentium10

2 Answers

While waiting for update by query support, I have opted for:

Use the scan/scroll API to loop over the document IDs you want to tag (related answer).
Use the bulk API to perform partial updates to set the tag on every matching doc.

Additionally I store the tag data (your CSV) in a separate doc type, and query from that and tag all new docs as they are created, i.e., to not have to first index and then update.

Python snippet to illustrate the approach:

def actiongen():
    docs = helpers.scan(es, query=myquery, index=myindex, fields=['_id'])
    for doc in docs:
        yield {
            '_op_type': 'update',
            '_index': doc['_index'],
            '_type': doc['_type'],
            '_id': doc['_id'],
            'doc': {'tags': tags},
        }

helpers.bulk(es, actiongen(), index=args.index, stats_only=True)

121

answered Sep 30 '22 19:09

Anton

Using the aforementioned update-by-query plugin, you would simply call:

curl -XPOST localhost:9200/index/type/_update_by_query -d '{
    "query": {"filtered": {"filter":{
        "not": {"term": {"tag": "github"}}
    }}},
    "script": "ctx._source.label = \"github\""
}'

The update-by-query plugin only accepts a script, not partial documents.

As for performance and memory issues, I guess the best thing is to give it a try.

answered Sep 30 '22 21:09

ofavre

Related questions
                            
                                Search for exact term in an Algolia index
                            
                                Is it worth to save keyword <-> link relation into "hastable" like structure in mysql?
                            
                                Double tap necessary to select TableView item with Search Bar
                            
                                App Engine Search API (Document Search) - Multiple Languages
                            
                                Searching in database with scrambled words in SQLite
                            
                                Bloom filters in a distributed environment
                            
                                What is the SQL used to do a search similar to "Related Questions" on Stackoverflow
                            
                                Can you perform an impersonated search in SharePoint without providing a password?
                            
                                Storing relational data in a Lucene.NET index
                            
                                Anyone has implemented SMA* search algorithm?
                            
                                Wildcard search in Solr
                            
                                Proper way to call a database function from Django?
                            
                                Can solr return function values (not solr score or document fields)?
                            
                                How do I search for the content of files in a Perforce depot (P4V)?
                            
                                Find a string within a cell using VBA
                            
                                Self-Organizing Search Program
                            
                                search for multiple strings
                            
                                Trying to implement path record for Haskell binary tree search
                            
                                Significant terms causes a CircuitBreakingException
                            
                                Search in PHP array with a custom comparator

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With