Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Question about Solr caching mechanism

Tags:

solr

I'm working at a Apache Solr project. ( distributed in a cloud environment - Amazon ec2 instances ).

I've noticed Solr does an excellent job in caching the results. When I execute the same queries again - the respons states Solr QTime 0 or 1 millisecond.

I want to stress test the Solr system. Therefore I have a limited list of queries I could use ( 50 000 unique queries ). The problem now is that all queries are cached!

When I stress test - after 5 mins or so - all my queries are given in Solr & executed. This makes the system sweat unther the heavy load :) ( witch was the purpose ). But then, as I execute the same query set again - QTime is almost zero! --> Solr has an easy time & isn't stressed.

My question: How can you turn of ALL Solr caches ( Both Solr and Lucence caches)? Or how can you limit the cache?

I've tried to turn of all Solr intern cache, but the cache still stays. ( QueryResultCache and FieldCache ) Note: The config mentions that Lucence will take management of an internal cache - maybe this cache is the problem?

It's just weird that all of the 50 000 queries can be stored in the cache - out of the box.

like image 209
Stijn V Avatar asked Nov 24 '10 16:11

Stijn V


2 Answers

You can comment out the filterCache, queryResultCache and documentCache in your configuration. Lucene's FieldCache cannot be disabled.

Although it doesn't really make any sense to do so, even for benchmarking. Would you also disable disk caching in your operating system? CPU caches (all three levels)? The internal cache of each hard disk?

Caches are part of the system, if you disable them you won't accurately simulate what happens in production, thus rendering the benchmark useless.

like image 87
Mauricio Scheffer Avatar answered Sep 28 '22 08:09

Mauricio Scheffer


Turning off caches is an excellent idea, at least those that are application specific. A benchmark in this case is intended I gather to find the response/cost of a query that has not been seen before; as opposed to those that are popular within a cache expire.

You sound like you want metrics that tell you how the search system performs; not the query cache.

Previous answers are really out of left field, suggesting all benchmarks should measure the same thing, "his own definition of " real life performance. That is not how engineering works.

As to the remark about "disk caches". There are no disk caches in Linux; only a page cache; whether that page is persisted on disk, created and destroyed in memory or pre-allocations for large file systems that are smart....they are all pages.

There is benefit to benchmarking with caches... if you bother to measure the cache performance metrics. duh.

BTW, between "-server" and "XXcompileThreshold" you want to make sure your first large set of queries are either random enough or specifically chosen to exercise as many function pathways in the Solr/Lucene as you can; so JIT is both active and somewhat settled.

like image 31
billy Avatar answered Sep 28 '22 09:09

billy