Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to improve parallel_bulk from python code for elastic insert?

I got some documents (size about 300o/doc) that I'd like to insert in my ES index using the python lib, I got huge time difference between the code and using curl it's obvious that it's normal, but I'd like to know if time can be improved (compared to the ratio of time)

  1. curl option takes about 20sec to insert and whole time 10sec (for printing ES result but after 20sec data is inserted)

    curl -H "Content-Type: application/json" -XPOST 
            "localhost:9200/contentindex/doc/_bulk?" --data-binary @superfile.bulk.json 
    
  2. With python option, I reached 1min20 as minimum, using the setting 10000/16/16 (chunk/thread/queue)

    import codecs
    from collections import deque
    from elasticsearch import Elasticsearch
    from elasticsearch.helpers import parallel_bulk
    
    es = Elasticsearch()
    
    def insert_data(filename, indexname):
        with codecs.open(filename, "r", encoding="utf-8", errors="ignore") as fic:
            for line in fic:        
                json_line = {}
                json_line["data1"] = "random_foo_bar1"
                json_line["data2"] = "random_foo_bar2"
                # more fields ...        
                yield {
                    "_index": indexname,
                    "_type": "doc",
                    "_source": json_line
                }
    
    if __name__ == '__main__':
     pb = parallel_bulk(es, insert_data("superfile.bulk.json", "contentindex"), 
                           chunk_size=10000, thread_count=16, queue_size=16)
     deque(pb, maxlen=0)
    

Facts

  • I got a machine with 2 processors xeon 8-core and 64GB ram
  • I tried multiple values for each [100-50000]/[2-24]/[2-24]

Questions

  • Can I still improve the time ?

  • If not, should I think of a way to write the data on a file and then use a process for curl command ?


If I try only the parse part it takes 15sec :

tm = time.time()
array = []

pb = insert_data("superfile.bulk.json", "contentindex") 
for p in pb:
   array.append(p)
print(time.time() - tm)            # 15

pb = parallel_bulk(es, array, chunk_size=10000, thread_count=16, queue_size=16)
dequeue(pb, maxlen = 0)
print(time.time() - tm)              # 90
like image 249
azro Avatar asked Jun 22 '26 16:06

azro


1 Answers

After my testing:

  1. curl working more faster than python client, obviously curl implemented better.

  2. After more testing and playing with parameters I can conclude:

    1. Elasticsearch index performance depends on the configuration of the index and the entire cluster. You can approach more performance by right mapping of fields into the index.
    2. My best approach was on 8 threads and 10000 items chunk. This depends on the configuration of index.index_concurrency that 8 by default.

    3. I think that using the multinode cluster with separate master node should improve performance.

    4. For more information, you can read a great 2 part article I found: here and here

like image 173
ozlevka Avatar answered Jun 25 '26 05:06

ozlevka



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!