I am using solr for data similar to name:age:sex:balance:nextbalance:interest
I have 30 M records totaling to 4G on disk. I am retrieving by age:23 which is only 50 records. I have indexed="true" in the schema xml. Solr seems to load the entire index on disk into memory (4G). Isnt it supposed to retrieve only the 40 odd records into memory ?
Apache Solr stores the data it indexes in the local filesystem by default. HDFS (Hadoop Distributed File System) provides several benefits, such as a large scale and distributed storage with redundancy and failover capabilities. Apache Solr supports storing data in HDFS.
After you post all your documents, call commit once manually or from SolrJ - it will take a while to commit, but this will be much faster overall. Also after you are done with your bulk import, reduce maxTime and maxDocs , so that any incremental posts you will do to Solr will get committed much sooner.
Full index takes about 40 hours using DB. There are some factors that might slowing you down: Memory.
Solr works by gathering, storing and indexing documents from different sources and making them searchable in near real-time. It follows a 3-step process that involves indexing, querying, and finally, ranking the results – all in near real-time, even though it can work with huge volumes of data.
Maybe this is document cache. You need to specify the size of it. Can you please check the following in solrconfig.xml?
<!-- documentCache caches Lucene Document objects (the stored fields for each document).
-->
<documentCache
class="solr.LRUCache"
size="16384"
initialSize="16384"/>
I think it depends on how you configure the cache (what it does and doesn't keep in memory). Loading the entire index into memory can give you huge performance boosts in terms of the time needed to retrieve results, regardless of the query.
Details on configuring cache, and details on performance factors:
Fields that are stored but not indexed, are saved on disk but not in RAM. However, 100% of the records are indeed indexed in RAM and those indexes contain all of the indexed fields. But inverted indexes are rather efficient for that.
However, when you do queries then SOLR does retrieve the entire set of stored (but not indexed) field contents into RAM for the records which match. This is usually considered to be desirable caching behavior because it means that search results can be transmitted sooner which reduces the overall query turnaround time. As usual with SOLR, you can configure caching behavior in many ways to match your RAM budget and database needs. Have a look at the possibilities in solrconfig.xml.
Note that this is a complex area and you probably will find it difficult to fully understand caching if Google is your main info source. This is an area where it is better to learn from one of the books on SOLR.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With