Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Usage of filter_path with helpers.scan in elastisearch client

When doing a search operation in elasticsearch i want the metadata to be filtered out and return only "_source" in the response. I'm able to achieve the same through "search" in the following way:

out1 = es.search(index='index.com', filter_path=['hits.hits._id', 'hits.hits._source'])

But when i do the same with scan method it just returns an empty list:

out2 = helpers.scan(es, query, index='index.com', doc_type='2016-07-27',filter_path= ['hits.hits._source'])

The problem may be with the way i'm processing the response of 'scan' method or with the way i'm passing the value to filter_path. To check the output i parse out2 to a list.

like image 719
Jai Sharma Avatar asked Dec 09 '16 06:12

Jai Sharma


People also ask

What are doc_ type in elastic search?

index – The name of the index. id – The document ID. body – The query definition using the Query DSL. doc_type – The type of the document. _source – True or false to return the _source field or not, or a list of fields to return.

How do I retrieve more than 10000 results events in elastic search?

By default, you cannot use from and size to page through more than 10,000 hits. This limit is a safeguard set by the index. max_result_window index setting. If you need to page through more than 10,000 hits, use the search_after parameter instead.

What is Elasticsearch DSL?

Elasticsearch DSL is a high-level library whose aim is to help with writing and running queries against Elasticsearch. It is built on top of the official low-level client ( elasticsearch-py ). It provides a more convenient and idiomatic way to write and manipulate queries.


2 Answers

The scan helper currently doesn't allow passing extra parameters to the scroll API so your filter_path doesn't apply to it. It does, however, get applied to the initial search API call which is used to initiate the scan/scroll cycle. This means that the scroll_id is stripped from the response causing the entire operation to fail.

In your case even passing the filter_path parameter to the scroll API calls would cause the helper to fail because it would strip the scroll_id which is needed for this operation to work and also because the helper relies on the structure of the response.

My recommendation would be to use source filtering if you need to limit the size of the response or use smaller size parameter than the default 1000.

Hope this helps, Honza

like image 158
Honza Král Avatar answered Oct 12 '22 18:10

Honza Král


You could pass filter_path=['_scroll_id', '_shards', 'hits.hits._source'] to the scan helper to get it to work. Obviously that leaves some metadata in the response but it removes as much as possible while allowing the scroll to work. _shards is required because it is used internally by the scan helper.

like image 30
Jazz Kersell Avatar answered Oct 12 '22 20:10

Jazz Kersell