I am using elasticsearch-py to connect to my ES database which contains over 3 million documents. I want to return all the documents so I can abstract data and write it to a csv. I was able to accomplish this easily for 10 documents (the default return) using the following code.
es=Elasticsearch("glycerin") query={"query" : {"match_all" : {}}} response= es.search(index="_all", doc_type="patent", body=query) for hit in response["hits"]["hits"]: print hit
Unfortunately, when I attempted to implement the scan & scroll so I could get all the documents I ran into issues. I tried it two different ways with no success.
Method 1:
scanResp= es.search(index="_all", doc_type="patent", body=query, search_type="scan", scroll="10m") scrollId= scanResp['_scroll_id'] response= es.scroll(scroll_id=scrollId, scroll= "10m") print response
After scroll/
it gives the scroll id and then ends with ?scroll=10m (Caused by <class 'httplib.BadStatusLine'>: ''))
Method 2:
query={"query" : {"match_all" : {}}} scanResp= helpers.scan(client= es, query=query, scroll= "10m", index="", doc_type="patent", timeout="10m") for resp in scanResp: print "Hiya"
If I print out scanResp before the for loop I get <generator object scan at 0x108723dc0>
. Because of this I'm relatively certain that I'm messing up my scroll somehow, but I'm not sure where or how to fix it.
Results: Again, after scroll/
it gives the scroll id and then ends with ?scroll=10m (Caused by <class 'httplib.BadStatusLine'>: ''))
I tried increasing the Max retries for the transport class, but that didn't make a difference.I would very much appreciate any insight into how to fix this.
Note: My ES is located on a remote desktop on the same network.
The python scan method is generating a GET call to the rest api. It is trying to send over your scroll_id over http. The most likely case here is that your scroll_id is too large to be sent over http and so you are seeing this error because it returns no response.
Because the scroll_id grows based on the number of shards you have it is better to use a POST and send the scroll_id in JSON as part of the request. This way you get around the limitation of it being too large for an http call.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With