Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Solr caching with EHCache/BigMemory

We are implementing a large Lucene/Solr setup with documents in excess of 150 million. We will also have a moderate amount document updates every day.

My question is really a two-part one:

What are the implications of using another caching implementation within Solr, i.e. EHCache instead of the native Solr LRUCache/FastLRUCache?

Terracotta has announced BigMemory that is meant to be used in conjunction with EHCache as an in-process off-heap cache. According to TC, this allows you to store large amounts of data without the GC overhead of the JVM. Is this a good idea to use with Solr? Will it actually help?

I would esp. like to hear from people with real production experience with EHCache/BigMemory and/or Solr Cache tuning.

like image 900
nvalada Avatar asked Feb 03 '11 12:02

nvalada


1 Answers

Lots of thoughts on this topic. Though my response doesn't leverage EhCache in any way.

First, I don't believe documents should be stored in your search index. Search content should be stored there, not the entire document. What I mean by this is, what's returned from your search query should be document IDs. Not the contents of the documents themselves. The documents themselves should be stored and retrieved from a second system, probably the original file store they are indexed from to begin with. This will reduce index size, decrease your document cache size, decrease master slave replication time (this can become a bottleneck if you update often), and decrease the overhead in writing search responses.

Next, consider putting a reverse HTTP proxy in front of Solr. Although the query caches allow Solr to respond quickly, a cache like Varnish sitting in front of Solr is even faster. This unloads Solr, allowing it to spend time responding to queries it hasn't seen before. The second effect is that you can now throw most of your memory at document caches instead of query caches. If you followed my first suggestion your documents will be incredibly small, allowing you to keep most, if not all of them in memory.

A quick back of the envelope calculation for document sizes. I can easily provide a 32 bit int as an ID for 150 million documents. I still have 10x headroom for document growth. 150 million IDs takes up 600MB. Add in a fudge factor for Solr wrapping documents, and you can probably easily have all your Solr documents cached in 1-2GB. Considering getting 12GB-24GB or RAM is easy nowadays, and I'd say you could do this all on 1 box and get incredible performance. No need for anything extraneous like EhCache. Just gotta make sure you use your search index as efficiently as possible.

Regarding GC: I didn't see a lot of GC time spent on my Solr servers. Most of what needed to be collected was the very short lived objects involved with HTTP request and response cycle, which never gets out of eden space. The caches didn't have high turnover when tuned correctly. The only large changes were when a new index was loaded and caches were flushed, but that wasn't happening constantly.

EDIT: For background, I spent some considerable time tuning Solr caching for a large company that sells consoles and serves millions of searches per day from their Solr servers.

like image 51
rfeak Avatar answered Sep 18 '22 10:09

rfeak