SolrCloud becoming slow over time

Tags:

I have a 3 node SolrCloud setup (replication factor 3), running on Ubuntu 14.04 Solr 6.0 on SSDs. Much indexing taking place, only softCommits. After some time, indexing speed becomes really slow, but when i restart the solr service on the node that became slow, everything gets back to normal. Problem is that i need to guess which node becomes slow.

I have 5 collections, but only one collection (mostly used) is getting slow. Total data size is 144G including tlogs.

Said core/collection is 99G including tlogs, tlog is just 313M. Heap size is 16G, Total memory is 32G, data is stored on SSD. Every node is configured the same.

What appears to be strange is that i have literally hundreds or thousands of log lines per second on both slaves when this hits:

2016-09-16 10:00:30.476 INFO  (qtp1190524793-46733) [c:mycollection s:shard1 r:core_node2 x:mycollection_shard1_replica1] o.a.s.u.p.LogUpdateProcessorFactory [mycollection_shard1_replica1]  webapp=/solr path=/update params={update.distrib=FROMLEADER&update.chain=add-unknown-fields-to-the-schema&distrib.from=http://192.168.0.3:8983/solr/mycollection_shard1_replica3/&wt=javabin&version=2}{add=[ka2PZAqO_ (1545622027473256450)]} 0 0
2016-09-16 10:00:30.477 INFO  (qtp1190524793-46767) [c:mycollection s:shard1 r:core_node2 x:mycollection_shard1_replica1] o.a.s.u.p.LogUpdateProcessorFactory [mycollection_shard1_replica1]  webapp=/solr path=/update params={update.distrib=FROMLEADER&update.chain=add-unknown-fields-to-the-schema&distrib.from=http://192.168.0.3:8983/solr/mycollection_shard1_replica3/&wt=javabin&version=2}{add=[nlFpoYNt_ (1545622027474305024)]} 0 0
2016-09-16 10:00:30.477 INFO  (qtp1190524793-46766) [c:mycollection s:shard1 r:core_node2 x:mycollection_shard1_replica1] o.a.s.u.p.LogUpdateProcessorFactory [mycollection_shard1_replica1]  webapp=/solr path=/update params={update.distrib=FROMLEADER&update.chain=add-unknown-fields-to-the-schema&distrib.from=http://192.168.0.3:8983/solr/mycollection_shard1_replica3/&wt=javabin&version=2}{add=[tclMjXH6_ (1545622027474305025), 98OPJ3EJ_ (1545622027476402176)]} 0 0
2016-09-16 10:00:30.478 INFO  (qtp1190524793-46668) [c:mycollection s:shard1 r:core_node2 x:mycollection_shard1_replica1] o.a.s.u.p.LogUpdateProcessorFactory [mycollection_shard1_replica1]  webapp=/solr path=/update params={update.distrib=FROMLEADER&update.chain=add-unknown-fields-to-the-schema&distrib.from=http://192.168.0.3:8983/solr/mycollection_shard1_replica3/&wt=javabin&version=2}{add=[btceXK4M_ (1545622027475353600)]} 0 0
2016-09-16 10:00:30.479 INFO  (qtp1190524793-46799) [c:mycollection s:shard1 r:core_node2 x:mycollection_shard1_replica1] o.a.s.u.p.LogUpdateProcessorFactory [mycollection_shard1_replica1]  webapp=/solr path=/update params={update.distrib=FROMLEADER&update.chain=add-unknown-fields-to-the-schema&distrib.from=http://192.168.0.3:8983/solr/mycollection_shard1_replica3/&wt=javabin&version=2}{add=[3ndK3HzB_ (1545622027476402177), riCqrwPE_ (1545622027477450753)]} 0 1
2016-09-16 10:00:30.479 INFO  (qtp1190524793-46820) [c:mycollection s:shard1 r:core_node2 x:mycollection_shard1_replica1] o.a.s.u.p.LogUpdateProcessorFactory [mycollection_shard1_replica1]  webapp=/solr path=/update params={update.distrib=FROMLEADER&update.chain=add-unknown-fields-to-the-schema&distrib.from=http://192.168.0.3:8983/solr/mycollection_shard1_replica3/&wt=javabin&version=2}{add=[wr5k3mfk_ (1545622027477450752)]} 0 0

In this case 192.168.0.3 is the master.

My workflow is that i insert batches of 2500 docs with ~10 threads at the same time which works perfectly fine for most of the time but sometimes it becomes slow as described. Ocassionally there are updates / indexing calls from other sources, but it's less than a percent.

UPDATE

Complete config (output from Config API) is http://pastebin.com/GtUdGPLG

UPDATE 2

These are the command line args:

-DSTOP.KEY=solrrocks
-DSTOP.PORT=7983
-Dhost=192.168.0.1
-Djetty.home=/opt/solr/server
-Djetty.port=8983
-Dlog4j.configuration=file:/var/solr/log4j.properties
-Dsolr.install.dir=/opt/solr
-Dsolr.solr.home=/var/solr/data
-Duser.timezone=UTC
-DzkClientTimeout=15000
-DzkHost=192.168.0.1:2181,192.168.0.2:2181,192.168.0.3:2181
-XX:+CMSParallelRemarkEnabled
-XX:+CMSScavengeBeforeRemark
-XX:+ParallelRefProcEnabled
-XX:+PrintGCApplicationStoppedTime
-XX:+PrintGCDateStamps
-XX:+PrintGCDetails
-XX:+PrintGCTimeStamps
-XX:+PrintHeapAtGC
-XX:+PrintTenuringDistribution
-XX:+UseCMSInitiatingOccupancyOnly
-XX:+UseConcMarkSweepGC
-XX:+UseParNewGC
-XX:CMSInitiatingOccupancyFraction=50
-XX:CMSMaxAbortablePrecleanTime=6000
-XX:ConcGCThreads=4
-XX:MaxTenuringThreshold=8
-XX:NewRatio=3
-XX:OnOutOfMemoryError=/opt/solr/bin/oom_solr.sh 8983 /var/solr/logs
-XX:ParallelGCThreads=4
-XX:PretenureSizeThreshold=64m
-XX:SurvivorRatio=4
-XX:TargetSurvivorRatio=90-Xloggc:/var/solr/logs/solr_gc.log
-Xms16G
-Xmx16G
-Xss256k
-verbose:gc

UPDATE 3

Happened again, these are some Sematext Graphs:

Sematext Dashboard for Master: Sematext Dashboard Master

Sematext Dashboard for Secondary 1: Sematext Dashboard Secondary 1

Sematext Dashboard for Secondary 2: Sematext Dashboard Secondary 2

Sematext GC for Master: Sematext GC Master

Sematext GC for Secondary 1: Sematext GC Secondary 1

Sematext GC for Secondary 2: Sematext GC Secondary 2

UPDATE 4 (2018-01-10)

This is a quite old question, but i recently discovered that someone installed a cryptocoin miner on all of my solr machines using CVE-2017-12629 which i fixed with an upgrade to 6.6.2.

If you're not sure if your system is infiltrated check the processes for user solr using ps aux | grep solr. If you see two or more processes, especially a non-java process, you might be running a miner.

543

asked Sep 16 '16 10:09

Stefan

1 Answers

So you're seeing disk I/O hitting 100% during indexing with a high-write throughput application.

There are two major drivers of disk I/O with Solr indexing:

Flushing in-memory index segments to disk.
Merging disk segments into new larger segments.

If your indexer isn't directly calling commit as a part of the indexing process (and you should make sure it isn't), Solr will flush index segments to disk based on your current settings:

Every time your RAM buffer fills up ("ramBufferSizeMB":100.0)
Based on your 3 min hard commit policy ("maxTime":180000)

If your indexer isn't directly calling optimize as a part of the indexing process (and you should make sure it isn't), Solr will periodically merge index segments on disk based on your current settings (the default merge policy):

mergeFactor: 10, or roughly each time the number of on-disk index segments exceeds 10.

Based on the way you've described your indexing process:

2500 doc batches per thread x 10 parallel threads

... you could probably get away with a larger RAM buffer, to yield larger initial index segments (that are then flushed to disk less frequently).

However the fact that your indexing process

works perfectly fine for most of the time but sometimes it becomes slow

... makes me wonder if you're just seeing the effect of a large merge happening in the background, and cannibalizing system resources needed for fast indexing at that moment.

Ideas

You could experiment with a larger mergeFactor (e.g. 25). This will reduce the frequency of background index segment merges, but not the resource drain when they happen. (Also, be aware that more index segments often translates to worse query performance).
In the indexConfig, you can try overriding the default settings for the ConcurrentMergeScheduler to throttle the number of merges that can be running at one time (maxMergeCount), and/or throttle the number of threads that can be used for merges (maxThreadCount), based on the system resources you're willing to make available.
You could increase your ramBufferSizeMB. This will reduce the frequency of in-memory index segments being flushed to disk, also serving to slow down the merge cadence.
If you are not relying on Solr for durability, you'll want /var/solr/data pointing to a local SSD volume. If you're going over a network mount (this has been documented with Amazon's EBS), there is a significant write throughput penalty, up to 10x less than writing to ephemeral/local storage.

answered Oct 02 '22 15:10

Peter Dixon-Moses

Related questions
                            
                                Order solr documents with same score by date added descending
                            
                                How to fix IndexNotFoundException: no segments* file found?
                            
                                When to use Cassandra vs. Solr in DSE?
                            
                                Solr - How to index on multiple entities?
                            
                                Rails Sunspot gem: Usings facets with multiple model site-wide searches
                            
                                Django, Haystack, Solr and Boosting
                            
                                Solr safe dataimport and core swap on high-traffic website
                            
                                Why this happens all the time? Solr OutOfMemoryError GC overhead limit exceeded
                            
                                How to integrate solr or elasticsearch with JPA?
                            
                                Solr query for matching nested/relational data
                            
                                use of Solr (Lucene) search on Google App Engine
                            
                                Solr - How to boost score for early matches?
                            
                                ShingleFilterFactory affects size of highlighted section in Solr
                            
                                Solr suggester throws stackoverflow error
                            
                                Searching all solr fields for value
                            
                                OutOfMemoryError: Java heap space error when start solr
                            
                                Automatically setup Apache Solr with Django - Oscar on AWS Elastic Beanstalk

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

SolrCloud becoming slow over time

Tags:

solr

solrcloud

Stefan

People also ask

1 Answers

Peter Dixon-Moses

Recent Activity

Donate For Us