Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Applying "tag" to millions of documents, using bulk/update methods

We have in our ElasticSearch instance about 55.000.000 of documents. We have a CSV file with user_ids, the biggest CSV has 9M entries. Our documents have user_id as the key, so this is convenient.

I am posting the question because I want to discuss and have the best option to get this done, as there are different ways to address this problem. We need to add the new "label" to the document if the user document doesn't have it yet eg tagging the user with "stackoverflow" or "github".

  1. There is the classic partial update endpoint. This sounds way slow as we need to iterate over 9M of user_ids and issue the api call for each of them.
  2. there is the bulk request, which provides some better performance but with limited 1000-5000 documents that can be mentioned in one call. And knowing when the batch is too large is kinda know how we need to learn on the go.
  3. Then there is the official open issue for /update_by_query endpoint which has lots of traffic, but no confirmation it was implemented in the standard release.
  4. On this open issue there is a mention for a update_by_query plugin which should provide some better handling, but there are old and open issues where users are complaining of performance problems and memory issues.
  5. I am not sure it it's doable on EL, but I thought I would load all the CSV entries into a separate index, and somehow would join the two indexes and apply script that would add the tag if doesn't exists yet.

So the question remains whats the best way to do this, and if some of you have done in past this, make sure you share your numbers/performance and how you would do differently this time.

like image 486
Pentium10 Avatar asked Oct 17 '14 14:10

Pentium10


People also ask

How do I apply tags to a resource group or subscription?

The tags are applied to the target resource group or subscription for the deployment. Each time you deploy the template you replace any tags there were previously applied. To apply the tags to a resource group, use either PowerShell or Azure CLI.

What is a resource tag and how is it applied?

Each tag consists of a name and a value pair. For example, you can apply the name Environment and the value Production to all the resources in production. For recommendations on how to implement a tagging strategy, see Resource naming and tagging decision guide. Tag names are case-insensitive for operations.

How to tag multiple files in SharePoint?

How to tag multiple files in SharePoint 1 Step 1: Upload multiple documents#N#First, let’s go ahead and upload few documents. There are several options for you to... 2 Step 2: Tag multiple documents More ...

How to bulk edit metadata with SharePoint 2010?

Must use Internet Explorer browser if you use SharePoint 2010. If you try to bulk edit metadata in other browsers, Datasheet View button will be grayed out. So make sure to switch to IE when you do bulk edit of metadata with SharePoint 2010.


2 Answers

While waiting for update by query support, I have opted for:

  1. Use the scan/scroll API to loop over the document IDs you want to tag (related answer).

  2. Use the bulk API to perform partial updates to set the tag on every matching doc.

Additionally I store the tag data (your CSV) in a separate doc type, and query from that and tag all new docs as they are created, i.e., to not have to first index and then update.

Python snippet to illustrate the approach:

def actiongen():
    docs = helpers.scan(es, query=myquery, index=myindex, fields=['_id'])
    for doc in docs:
        yield {
            '_op_type': 'update',
            '_index': doc['_index'],
            '_type': doc['_type'],
            '_id': doc['_id'],
            'doc': {'tags': tags},
        }

helpers.bulk(es, actiongen(), index=args.index, stats_only=True)
like image 121
Anton Avatar answered Sep 30 '22 19:09

Anton


Using the aforementioned update-by-query plugin, you would simply call:

curl -XPOST localhost:9200/index/type/_update_by_query -d '{
    "query": {"filtered": {"filter":{
        "not": {"term": {"tag": "github"}}
    }}},
    "script": "ctx._source.label = \"github\""
}'

The update-by-query plugin only accepts a script, not partial documents.

As for performance and memory issues, I guess the best thing is to give it a try.

like image 35
ofavre Avatar answered Sep 30 '22 21:09

ofavre