Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

elasticsearch-py scan and scroll to return all documents

Tags:

I am using elasticsearch-py to connect to my ES database which contains over 3 million documents. I want to return all the documents so I can abstract data and write it to a csv. I was able to accomplish this easily for 10 documents (the default return) using the following code.

es=Elasticsearch("glycerin") query={"query" : {"match_all" : {}}} response= es.search(index="_all", doc_type="patent", body=query)  for hit in response["hits"]["hits"]:   print hit 

Unfortunately, when I attempted to implement the scan & scroll so I could get all the documents I ran into issues. I tried it two different ways with no success.

Method 1:

scanResp= es.search(index="_all", doc_type="patent", body=query, search_type="scan", scroll="10m")   scrollId= scanResp['_scroll_id']  response= es.scroll(scroll_id=scrollId, scroll= "10m") print response 

enter image description here After scroll/ it gives the scroll id and then ends with ?scroll=10m (Caused by <class 'httplib.BadStatusLine'>: ''))

Method 2:

query={"query" : {"match_all" : {}}} scanResp= helpers.scan(client= es, query=query, scroll= "10m", index="", doc_type="patent", timeout="10m")  for resp in scanResp:     print "Hiya" 

If I print out scanResp before the for loop I get <generator object scan at 0x108723dc0>. Because of this I'm relatively certain that I'm messing up my scroll somehow, but I'm not sure where or how to fix it.

Results: enter image description here Again, after scroll/ it gives the scroll id and then ends with ?scroll=10m (Caused by <class 'httplib.BadStatusLine'>: ''))

I tried increasing the Max retries for the transport class, but that didn't make a difference.I would very much appreciate any insight into how to fix this.

Note: My ES is located on a remote desktop on the same network.

like image 663
drowningincode Avatar asked Apr 07 '14 19:04

drowningincode


1 Answers

The python scan method is generating a GET call to the rest api. It is trying to send over your scroll_id over http. The most likely case here is that your scroll_id is too large to be sent over http and so you are seeing this error because it returns no response.

Because the scroll_id grows based on the number of shards you have it is better to use a POST and send the scroll_id in JSON as part of the request. This way you get around the limitation of it being too large for an http call.

like image 138
chrstahl89 Avatar answered Sep 19 '22 07:09

chrstahl89