Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

SOLR performance tuning

Tags:

java

solr

lucene

I've read the following:

http://wiki.apache.org/solr/SolrPerformanceFactors

http://wiki.apache.org/solr/SolrCaching

http://www.lucidimagination.com/content/scaling-lucene-and-solr

And I have questions about a few things:

  1. If I use the JVM option -XX:+UseCompressedStrings what kind of memory savings can I achieve? To keep a simple example, if I have 1 indexed field (string) and 1 stored field (string) with omitNorms=true and omitTf=true, what kind of savings in the index and document cache can I expect? I'm guessing about 50%, but maybe that's too optimistic.
  2. When exactly is the Solr filter cache doing? If I'm just doing a simple query with AND and a few ORs, and sorting by score, do I even need it?
  3. If I want to cache all documents in the document cache, how would I compute the space required? Using the example from above, if I have 20M documents, use compressed strings, and the average length of the stored field is 25 characters, is the space required basically (25 bytes + small_admin_overhead) * 20M?
  4. if all documents are in the document cache, how important is the query cache?
  5. If I want to autowarm every document into the doc cache, will autowarm query of *:* do it?
  6. The scaling-lucene-and-solr article says FuzzyQuery is slow. If I'm using the spellcheck feature of solr then I'm basically using fuzzy query right (because spellcheck does the same edit distance calculation)? So presumably spellcheck and fuzzy query are both equally "slow"?
  7. The section describing the lucene field cache for strings is a bit confusing. Am I reading it correctly that the space required is basically the size of the indexed string field + an integer arry equal to the number of unique terms in that field?
  8. Finally, under maximizing throughput, there is a statement about leaving enough space for the OS disk cache. It says, "All in all, for a large scale index, it's best to be sure you have at least a few gigabytes of RAM beyond what you are giving to the JVM.". So if I have a 12GB memory machine (as an example), I should give at least 2-3GB to the OS? Can I estimate the disk cache space needed by the OS by looking at the on disk index size?
like image 848
Kevin Avatar asked Dec 25 '11 00:12

Kevin


2 Answers

  1. Only way to be sure is to try it out. However, I would expect very little savings in the Index, as the index would only contain the actual string once each time, the rest is data for locations of that string within documents. They aren't a large part of the index.
  2. Filter cache only caches filter queries. It may not be useful for your precise use case, but many do find them useful. For example, narrowing results by country, language, product type, etc. Solr can avoid recalculating the query results for things like this if you use them frequently.
  3. Realistically, you just have to try it and measure it with a profiler. Without in depth knowledge of EXACTLY the data structure used, anything else is pure SWAG. Your calculation is just as good as anyone else's without profiling.
  4. Document cache only saves time in constituting the results AFTER the query has been calculated. If you spend most of your time calculating queries, the document cache will do you little good. Query cache is only useful for re-used queries. If none of your queries are repeated, then Query cache is useless
  5. yes, assuming your Document cache is large enough to hold them all.

6-8 Not positive.

From my own experience with Solr performance tuning, you should leave Solr to deal with queries, not document storage. The majority of your questions focus on how documents take up space. Solr is a search engine, not a document storage repository. If you want Solr to be FAST and take up minimal memory, then the only thing it should hold onto is index information for searching purposes. The documents themselves should be stored, retrieved, and rendered elsewhere. Preferably in system that is optimized specifically for that job. The only field you should store in your Solr document is an ID for retrieval from the document storage system.

like image 97
rfeak Avatar answered Nov 02 '22 17:11

rfeak


Caches

In general, caching looks like a good idea to improve performance, but this also has a lot of issues:

  • cached objects are likely to go into the old generation of the garbage collector, which is more costly to collect,
  • managing insertions and evictions adds some overhead.

Moreover, caching is unlikely to improve your search latency much unless there are patterns in your queries. On the contrary, if 20% of your traffic is due to a few queries, then the query results cache may be interesting. Configuring caches requires you to know your queries and your documents very well. If you don't, you should probably disable caching.

Even if you disable all caches, performance could still be pretty good thanks to the OS I/O cache. Practically, this means that if you read the same portion of a file again and again, it is likely that it will be read from disk only the first time, and then from the I/O cache. And disabling all caches allows you to give less memory to the JVM, so that there will be more memory for the I/O cache. If your system has 12GB of memory and if you give 2GB to the JVM, this means that the I/O cache might be able to cache up to 10G of your index (depending on other applications running which require memory too).

I recommand you read this to get more information on application-level cache vs. I/O cache:

https://www.varnish-cache.org/trac/wiki/ArchitectNotes

http://antirez.com/post/what-is-wrong-with-2006-programming.html

Field cache

The size of the field cache for a string is (one array of integers of length maxDoc) + (one array for all unique string instances). So if you have an index with one string field which has N instances of size S on average, and if your index has M documents, then the size of the field cache for this field will be approximately M * 4 + N * S.

The field cache is mainly used for facets and sorting. Even very short strings (less than 10 chars) are more than 40 bytes, this means that you should expect Solr to require a lot of memory if you sort or facet on a String field which has a high number of unique values.

Fuzzy Query

FuzzyQuery is slow in Lucene 3.x, but much faster in Lucene 4.x.

It depends on the Spellchecker implementation you choose but I think that the Solr 3.x spell checker uses N-Grams to find candidates (this is why it needs a dedicated index) and then only computes distances on this set on candidates, so the performance is still reasonably good.

like image 39
jpountz Avatar answered Nov 02 '22 17:11

jpountz