We have in our ElasticSearch instance about 55.000.000 of documents. We have a CSV file with user_ids, the biggest CSV has 9M entries. Our documents have user_id as the key, so this is convenient.
I am posting the question because I want to discuss and have the best option to get this done, as there are different ways to address this problem. We need to add the new "label" to the document if the user document doesn't have it yet eg tagging the user with "stackoverflow" or "github".
update
endpoint. This sounds way slow as we need to iterate over 9M of user_ids and issue the api call for each of them.bulk
request, which provides some better performance but with limited 1000-5000 documents that can be mentioned in one call. And knowing when the batch is too large is kinda know how we need to learn on the go./update_by_query
endpoint which has lots of traffic, but no confirmation it was implemented in the standard release.So the question remains whats the best way to do this, and if some of you have done in past this, make sure you share your numbers/performance and how you would do differently this time.
The tags are applied to the target resource group or subscription for the deployment. Each time you deploy the template you replace any tags there were previously applied. To apply the tags to a resource group, use either PowerShell or Azure CLI.
Each tag consists of a name and a value pair. For example, you can apply the name Environment and the value Production to all the resources in production. For recommendations on how to implement a tagging strategy, see Resource naming and tagging decision guide. Tag names are case-insensitive for operations.
How to tag multiple files in SharePoint 1 Step 1: Upload multiple documents#N#First, let’s go ahead and upload few documents. There are several options for you to... 2 Step 2: Tag multiple documents More ...
Must use Internet Explorer browser if you use SharePoint 2010. If you try to bulk edit metadata in other browsers, Datasheet View button will be grayed out. So make sure to switch to IE when you do bulk edit of metadata with SharePoint 2010.
While waiting for update by query support, I have opted for:
Use the scan/scroll API to loop over the document IDs you want to tag (related answer).
Use the bulk API to perform partial updates to set the tag on every matching doc.
Additionally I store the tag data (your CSV) in a separate doc type, and query from that and tag all new docs as they are created, i.e., to not have to first index and then update.
Python snippet to illustrate the approach:
def actiongen():
docs = helpers.scan(es, query=myquery, index=myindex, fields=['_id'])
for doc in docs:
yield {
'_op_type': 'update',
'_index': doc['_index'],
'_type': doc['_type'],
'_id': doc['_id'],
'doc': {'tags': tags},
}
helpers.bulk(es, actiongen(), index=args.index, stats_only=True)
Using the aforementioned update-by-query plugin, you would simply call:
curl -XPOST localhost:9200/index/type/_update_by_query -d '{
"query": {"filtered": {"filter":{
"not": {"term": {"tag": "github"}}
}}},
"script": "ctx._source.label = \"github\""
}'
The update-by-query plugin only accepts a script, not partial documents.
As for performance and memory issues, I guess the best thing is to give it a try.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With