Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Solr loads entire index into memory

I am using solr for data similar to name:age:sex:balance:nextbalance:interest

I have 30 M records totaling to 4G on disk. I am retrieving by age:23 which is only 50 records. I have indexed="true" in the schema xml. Solr seems to load the entire index on disk into memory (4G). Isnt it supposed to retrieve only the 40 odd records into memory ?

like image 795
Hari Avatar asked Mar 14 '12 14:03

Hari


People also ask

Where is Solr index stored?

Apache Solr stores the data it indexes in the local filesystem by default. HDFS (Hadoop Distributed File System) provides several benefits, such as a large scale and distributed storage with redundancy and failover capabilities. Apache Solr supports storing data in HDFS.

How can I make Solr index faster?

After you post all your documents, call commit once manually or from SolrJ - it will take a while to commit, but this will be much faster overall. Also after you are done with your bulk import, reduce maxTime and maxDocs , so that any incremental posts you will do to Solr will get committed much sooner.

How long is Solr indexing?

Full index takes about 40 hours using DB. There are some factors that might slowing you down: Memory.

How does Solr indexing work?

Solr works by gathering, storing and indexing documents from different sources and making them searchable in near real-time. It follows a 3-step process that involves indexing, querying, and finally, ranking the results – all in near real-time, even though it can work with huge volumes of data.


3 Answers

Maybe this is document cache. You need to specify the size of it. Can you please check the following in solrconfig.xml?

<!-- documentCache caches Lucene Document objects (the stored fields for each document).
  -->
<documentCache
  class="solr.LRUCache"
  size="16384"
  initialSize="16384"/>
like image 126
stzoannos Avatar answered Oct 01 '22 00:10

stzoannos


I think it depends on how you configure the cache (what it does and doesn't keep in memory). Loading the entire index into memory can give you huge performance boosts in terms of the time needed to retrieve results, regardless of the query.

Details on configuring cache, and details on performance factors:

  • https://cwiki.apache.org/confluence/display/SOLR/SolrPerformanceFactors
like image 28
jefflunt Avatar answered Oct 01 '22 00:10

jefflunt


Fields that are stored but not indexed, are saved on disk but not in RAM. However, 100% of the records are indeed indexed in RAM and those indexes contain all of the indexed fields. But inverted indexes are rather efficient for that.

However, when you do queries then SOLR does retrieve the entire set of stored (but not indexed) field contents into RAM for the records which match. This is usually considered to be desirable caching behavior because it means that search results can be transmitted sooner which reduces the overall query turnaround time. As usual with SOLR, you can configure caching behavior in many ways to match your RAM budget and database needs. Have a look at the possibilities in solrconfig.xml.

Note that this is a complex area and you probably will find it difficult to fully understand caching if Google is your main info source. This is an area where it is better to learn from one of the books on SOLR.

like image 21
Michael Dillon Avatar answered Oct 01 '22 01:10

Michael Dillon