I'm trying to query elasticsearch
with the elasticsearch-spark
connector and I want to return only few results:
For example:
val conf = new SparkConf().set("es.nodes","localhost").set("es.index.auto.create", "true").setMaster("local")
val sparkContext = new SparkContext(conf)
val query = "{\"size\":1}"
println(sparkContext.esRDD("index_name/type", query).count())
However this will return all the documents in the index.
Some parameters are actually ignored from the query by design, such as : from
, size
, fields
, etc.
They are used internally inside the elasticsearch-spark
connector.
Unfortunately this list of unsupported parameters isn't documented. But if you wish to use the size
parameter you can always rely on the pushdown
predicate and use the DataFrame
/Dataset
limit
method.
So you ought using the Spark SQL DSL instead e.g :
val df = sqlContext.read.format("org.elasticsearch.spark.sql")
.option("pushdown","true")
.load("index_name/doc_type")
.limit(10) // instead of size : 10
This query will return the first 10 documents returned by the match_all
query that is used by default by the connector.
Note: The following isn't correct on any level.
This is actually on purpose. Since the connector does a parallel query, it also looks at the number of documents being returned so if the user specifies a parameter, it will overwrite it according to the es.scroll.limit setting (see the configuration option).
When you query elasticsearch it also run the query in parallel on all the index shards without overwriting them.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With