I have a dedicated index for percolators only.
There are 3000
queries there. Here's a typical query:
{
"index": "articles_percolators",
"type": ".percolator",
"body": {
"query": {
"filtered": {
"query": {
"bool": {
"should": [
{
"query_string": {
"fields": [
"title"
],
"query": "Cities|urban|urbanization",
"allow_leading_wildcard": false
}
},
{
"query_string": {
"fields": [
"content"
],
"query": "Cities|urban|urbanization",
"allow_leading_wildcard": false
}
},
{
"query_string": {
"fields": [
"url"
],
"query": "Cities|urban|urbanization",
"allow_leading_wildcard": false
}
}
]
}
},
"filter": {
"bool": {
"must": [
{
"terms": {
"feed_id": [
3215,
3216,
10674,
26041
]
}
}
]
}
}
}
},
"sort": {
"date": {
"order": "desc"
}
},
"fields": [
"_id"
]
},
"id": "562"
}
Mapping (PHP array). Filters, analyzers and tokenizers are excluded for brevity:
'index' => 'articles_percolators',
'body' => [
'settings' => [
'number_of_shards' => 8,
'number_of_replicas' => 0,
'refresh_interval' => -1,
'analysis' => [
'filter' => [
],
'analyzer' => [
],
'tokenizer'=> [
]
]
],
'mappings' => [
'article' => [
'_source' => ['enabled' => false],
'_all' => ['enabled' => false],
'_analyzer' => ['path' => 'lang_analyzer'],
'properties' => [
'lang_analyzer' => [
'type' => 'string',
'doc_values' => true,
'store' => false,
'index' => 'no'
],
'date' => [
'type' => 'date',
'doc_values' => true
],
'feed_id' => [
'type' => 'integer'
],
'feed_subscribers' => [
'type' => 'integer'
],
'feed_canonical' => [
'type' => 'boolean'
],
'title' => [
'type' => 'string',
'store' => false,
],
'content' => [
'type' => 'string',
'store' => false,
],
'url' => [
'type' => 'string',
'analyzer' => 'simple',
'store' => false
]
]
]
]
]
I am then sending documents to the mpercolate
API, 100
at a time. Here's a part (1 document) of the mpercolate
request:
{
"percolate": {
"index": "articles_percolators",
"type": "article"
}
},
{
"doc": {
"title": "Win a Bench Full of Test Equipment",
"url": "\/document.asp",
"content": "Keysight Technologies is giving away a bench full of general-purpose test equipment.",
"date": 1421194639401,
"feed_id": 12240778,
"feed_subscribers": 52631,
"feed_canonical": 1,
"lang_analyzer": "en_analyzer"
}
}
100
articles are processed in ~1 second
on my MacBook Pro 2.4 GHz Intel Core i7 (4 cores, 8 with HT) with all cores at maximum:
This seems rather slow to me, but I don't have a base to compare with.
I have a regular index with the same mapping (but with 6 shards) with over 3 Billion documents (still) living on a single server with 24 core
Xeon and 128GB
RAM. Such queries search across the whole index in less than 100ms
(on a hot server).
Is there something obviously wrong in my setup and did anyone else benchmarked the performance of percolators? I didn't find anything else in the web about this...
My ES version is 1.4.2
with default configuration and workload is completely CPU bound.
Because John Petrone's
comment is right about testing on a production environment I have made the test on the same 24 core Xeon as we use in production. The result with 8 shards index for percolation is the same if not even worse. The times are somewhere between 1s and 1.2s while the network latency there is lower than my laptop's.
This can probably be explained by the slower clock speed per core for the Xeon - 2.0GHz
vs 2.4Ghz
for the i7.
It results in almost constant CPU utilization of around 800%:
I have then recreated the index with 24 shards and times have dropped to 0.8s per 100 documents, but with more than double the CPU time:
I have a constant flow of around 100 documents per second and the number of queries will rise in the future, so this is somewhat a concern for me.
So just to be clear, you can't compare normal Elasticsearch performance on a 24 core Xeon with 128GB memory against ES percolate performance on a laptop - very different hardware and very different software.
With many large index setups (like your's with 3 billion docs) you tend to be either disk or memory bound when running queries. As long as you have enough of both, query performance can be quite high.
Percolation is different - you are in effect indexing each document and then running each query stored in the percolator against each document, all in in-memory Lucene indexes:
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/search-percolate.html
Percolation scales horizontally and tends to be cpu bound - you scale it by adding additional nodes with sufficient cpu.
With 100 documents submitted via the multi percolate api against 3000 registered percolate queries you are basically running 300,000 individual queries. I would expect that to be cpu bound on a Macbook - I think you'd be better off benchmarking this in an environment that's more controlled (separate server) and one that you can scale by adding additional nodes.
UPDATE
So to get a better idea of what the bottleneck is and how to improve your performance your going to need to start with lower numbers of registered queries and lower numbers of documents at a time and then ratchet up. This will give you a much clearer picture of what's going on behind the scenes.
I'd start with a single document (not 100) and far fewer queries registered and then run a series of tests, some raising the number of documents, some raising the number of queries registered, in multiple steps and then go above 100 documents and a time and above 3000 queries.
By looking at the results you will get a better idea of how performance declines vs. load - linear with number of documents, linear with number of registered queries.
Other variants of configuration I would try - instead of 100 docs via the bulk percolate api, try the single doc api in multiple threads (to see if it's a multi doc api issue). I'd also try running multiple nodes on the same system, or use many smaller servers, to see if you get better performance across multiple smaller nodes. I'd also vary the amount of memory allocated to the JVM (more is not necessarily better).
Ultimately you want a range of data points to try to identify how your queries scale and where the inflection points are.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With