I've read the following:
http://wiki.apache.org/solr/SolrPerformanceFactors
http://wiki.apache.org/solr/SolrCaching
http://www.lucidimagination.com/content/scaling-lucene-and-solr
And I have questions about a few things:
-XX:+UseCompressedStrings
what kind of memory savings can I achieve? To keep a simple example, if I have 1 indexed field (string) and 1 stored field (string) with omitNorms=true and omitTf=true, what kind of savings in the index and document cache can I expect? I'm guessing about 50%, but maybe that's too optimistic.*:*
do it?6-8 Not positive.
From my own experience with Solr performance tuning, you should leave Solr to deal with queries, not document storage. The majority of your questions focus on how documents take up space. Solr is a search engine, not a document storage repository. If you want Solr to be FAST and take up minimal memory, then the only thing it should hold onto is index information for searching purposes. The documents themselves should be stored, retrieved, and rendered elsewhere. Preferably in system that is optimized specifically for that job. The only field you should store in your Solr document is an ID for retrieval from the document storage system.
Caches
In general, caching looks like a good idea to improve performance, but this also has a lot of issues:
Moreover, caching is unlikely to improve your search latency much unless there are patterns in your queries. On the contrary, if 20% of your traffic is due to a few queries, then the query results cache may be interesting. Configuring caches requires you to know your queries and your documents very well. If you don't, you should probably disable caching.
Even if you disable all caches, performance could still be pretty good thanks to the OS I/O cache. Practically, this means that if you read the same portion of a file again and again, it is likely that it will be read from disk only the first time, and then from the I/O cache. And disabling all caches allows you to give less memory to the JVM, so that there will be more memory for the I/O cache. If your system has 12GB of memory and if you give 2GB to the JVM, this means that the I/O cache might be able to cache up to 10G of your index (depending on other applications running which require memory too).
I recommand you read this to get more information on application-level cache vs. I/O cache:
https://www.varnish-cache.org/trac/wiki/ArchitectNotes
http://antirez.com/post/what-is-wrong-with-2006-programming.html
Field cache
The size of the field cache for a string is (one array of integers of length maxDoc) + (one array for all unique string instances). So if you have an index with one string field which has N instances of size S on average, and if your index has M documents, then the size of the field cache for this field will be approximately M * 4 + N * S
.
The field cache is mainly used for facets and sorting. Even very short strings (less than 10 chars) are more than 40 bytes, this means that you should expect Solr to require a lot of memory if you sort or facet on a String field which has a high number of unique values.
Fuzzy Query
FuzzyQuery is slow in Lucene 3.x, but much faster in Lucene 4.x.
It depends on the Spellchecker implementation you choose but I think that the Solr 3.x spell checker uses N-Grams to find candidates (this is why it needs a dedicated index) and then only computes distances on this set on candidates, so the performance is still reasonably good.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With