Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Solr indexing - Master/Slave replication, how to handle huge index and high traffic?

I'm currently facing an issue with SOLR (more exactly with the slaves replication) and after having spent quite a few time reading online I find myself having to ask for some enlightenment.

- Does Solr have some limitation in size for its index?

When dealing with a single master, when is it the right moment to decide to use multi cores or multi indexes? Is there any indications on when reaching a certain size of index, partitioning is recommended?

- Is there any max size when replicating segments from master to slave?

When replicating, is there a segment size limit when the slave won't be able to download the content and index it? What is the threshold to which a slave won't be able to replicate when there's a lot of traffic to retrieve info and lots of new documents to replicate.

To be more factual, here is the context that led me to these questions: We want to index a fair amount of documents, but when the amount reaches more than a dozen millions, the slaves can't handle it and start failing replicating with a SnapPull error. The documents are composed with a few text fields (name, type, description, ... about 10 other fields of let's say 20 characters max).

We have one master, and 2 slaves which replicate data from the master.

This is my first time working with Solr (I work usually on webapps using spring, hibernate... but no use of Solr), so I'm not sure how to tackle this issue.

Our idea is for the moment to add multiple cores to the master, and have a slave replicating from each of this core. Is it the right way to go?

If it is, how can we determine the number of cores needed? Right now we're just going to try and see how it behaves and adjust if necessary, but I was wondering if there was any best practices or some benchmarks that have been done on this specific topic.

For this amount of documents with this average size, x cores or indexes are needed ...

Thanks for any help in how I could deal with a huge amount of documents of average size!

Here is a copy of the error that is being thrown when a slave is trying to replicate:

ERROR [org.apache.solr.handler.ReplicationHandler] - <SnapPull failed >
org.apache.solr.common.SolrException: Index fetch failed :
        at org.apache.solr.handler.SnapPuller.fetchLatestIndex(SnapPuller.java:329)
        at org.apache.solr.handler.ReplicationHandler.doFetch(ReplicationHandler.java:264)
        at org.apache.solr.handler.SnapPuller$1.run(SnapPuller.java:159)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:417)
        at java.util.concurrent.FutureTask$Sync.innerRunAndReset(FutureTask.java:280)
        at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:135)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$101(ScheduledThreadPoolExecutor.java:65)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.runPeriodic(ScheduledThreadPoolExecutor.java:142)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:166)
        at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:650)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:675)
        at java.lang.Thread.run(Thread.java:595)
Caused by: java.lang.RuntimeException: java.io.IOException: read past EOF
        at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:1068)
        at org.apache.solr.update.DirectUpdateHandler2.commit(DirectUpdateHandler2.java:418)
        at org.apache.solr.handler.SnapPuller.doCommit(SnapPuller.java:467)
        at org.apache.solr.handler.SnapPuller.fetchLatestIndex(SnapPuller.java:319)
        ... 11 more
Caused by: java.io.IOException: read past EOF
        at org.apache.lucene.store.BufferedIndexInput.refill(BufferedIndexInput.java:151)
        at org.apache.lucene.store.BufferedIndexInput.readByte(BufferedIndexInput.java:38)
        at org.apache.lucene.store.IndexInput.readInt(IndexInput.java:70)
        at org.apache.lucene.index.SegmentInfos$2.doBody(SegmentInfos.java:410)
        at org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:704)
        at org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:538)
        at org.apache.lucene.index.SegmentInfos.readCurrentVersion(SegmentInfos.java:402)
        at org.apache.lucene.index.DirectoryReader.isCurrent(DirectoryReader.java:791)
        at org.apache.lucene.index.DirectoryReader.doReopen(DirectoryReader.java:404)
        at org.apache.lucene.index.DirectoryReader.reopen(DirectoryReader.java:352)
        at org.apache.solr.search.SolrIndexReader.reopen(SolrIndexReader.java:413)
        at org.apache.solr.search.SolrIndexReader.reopen(SolrIndexReader.java:424)
        at org.apache.solr.search.SolrIndexReader.reopen(SolrIndexReader.java:35)
        at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:1049)
        ... 14 more

EDIT: After Mauricio's answer, the solr libraries have been updated to 1.4.1 but this error was still raised. I increased the commitReserveDuration and even if the "SnapPull Failed" error seems to have disappeared, another one started being raised, not sure about why as I can't seem to find much answer on the web:

ERROR [org.apache.solr.servlet.SolrDispatchFilter] - <ClientAbortException:  java.io.IOException
        at org.apache.catalina.connector.OutputBuffer.realWriteBytes(OutputBuffer.java:370)
        at org.apache.tomcat.util.buf.ByteChunk.append(ByteChunk.java:323)
        at org.apache.catalina.connector.OutputBuffer.writeBytes(OutputBuffer.java:396)
        at org.apache.catalina.connector.OutputBuffer.write(OutputBuffer.java:385)
        at org.apache.catalina.connector.CoyoteOutputStream.write(CoyoteOutputStream.java:89)
        at org.apache.solr.common.util.FastOutputStream.flushBuffer(FastOutputStream.java:183)
        at org.apache.solr.common.util.JavaBinCodec.marshal(JavaBinCodec.java:89)
        at org.apache.solr.request.BinaryResponseWriter.write(BinaryResponseWriter.java:48)
        at org.apache.solr.servlet.SolrDispatchFilter.writeResponse(SolrDispatchFilter.java:322)
        at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:254)
        at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:215)
        at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:188)
        at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:213)
        at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:172)
        at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127)
        at org.jstripe.tomcat.probe.Tomcat55AgentValve.invoke(Tomcat55AgentValve.java:20)
        at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:117)
        at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:108)
        at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:174)
        at org.apache.coyote.http11.Http11AprProcessor.process(Http11AprProcessor.java:837)
        at org.apache.coyote.http11.Http11AprProtocol$Http11ConnectionHandler.process(Http11AprProtocol.java:640)
        at org.apache.tomcat.util.net.AprEndpoint$Worker.run(AprEndpoint.java:1286)
        at java.lang.Thread.run(Thread.java:595)
Caused by: java.io.IOException
        at org.apache.coyote.http11.InternalAprOutputBuffer.flushBuffer(InternalAprOutputBuffer.java:703)
        at org.apache.coyote.http11.InternalAprOutputBuffer$SocketOutputBuffer.doWrite(InternalAprOutputBuffer.java:733)
        at org.apache.coyote.http11.filters.ChunkedOutputFilter.doWrite(ChunkedOutputFilter.java:124)
        at org.apache.coyote.http11.InternalAprOutputBuffer.doWrite(InternalAprOutputBuffer.java:539)
        at org.apache.coyote.Response.doWrite(Response.java:560)
        at org.apache.catalina.connector.OutputBuffer.realWriteBytes(OutputBuffer.java:365)
        ... 22 more
>
ERROR [org.apache.catalina.core.ContainerBase.[Catalina].[localhost].[/].[SolrServer]] - <Servlet.service() for servlet SolrServer threw exception>
java.lang.IllegalStateException
        at org.apache.catalina.connector.ResponseFacade.sendError(ResponseFacade.java:405)
        at org.apache.solr.servlet.SolrDispatchFilter.sendError(SolrDispatchFilter.java:362)
        at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:272)
        at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:215)
        at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:188)
        at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:213)
        at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:172)
        at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127)
        at org.jstripe.tomcat.probe.Tomcat55AgentValve.invoke(Tomcat55AgentValve.java:20)
        at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:117)
        at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:108)
        at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:174)
        at org.apache.coyote.http11.Http11AprProcessor.process(Http11AprProcessor.java:837)
        at org.apache.coyote.http11.Http11AprProtocol$Http11ConnectionHandler.process(Http11AprProtocol.java:640)
        at org.apache.tomcat.util.net.AprEndpoint$Worker.run(AprEndpoint.java:1286)
        at java.lang.Thread.run(Thread.java:595)

I still wonder what are the best practices to handle a big index (more than 20G) containing a lot of documents with solr. Am I missing some obvious links somewhere? Tutorials, documentations?

like image 446
Fanny H. Avatar asked Nov 13 '10 01:11

Fanny H.


People also ask

How much data can Solr handle?

In our experience, a good number of documents per server is about 400 million. For really large data sets, you'll need to watch out for the hard limit of two billion documents per Solr core.

How is indexing done in Solr?

By adding content to an index, we make it searchable by Solr. A Solr index can accept data from many different sources, including XML files, comma-separated value (CSV) files, data extracted from tables in a database, and files in common file formats such as Microsoft Word or PDF.

How long is Solr indexing?

Full index takes about 40 hours using DB. There are some factors that might slowing you down: Memory.


Video Answer


1 Answers

  • Cores are a tool primarily used to have different schemas in a single Solr instance. Also used as on-deck indexes. Sharding and replication are orthogonal issues.
  • You mention "a lot of traffic". That's a highly subjective measure. Instead, try to determinate how many QPS (queries per second) you need from Solr. Also, does a single Solr instance answer your queries fast enough? Only then can you determine if you need to scale out. A single Solr instance can handle a lot of traffic, maybe you don't even need to scale.
  • Make sure you run Solr on a server with plenty of memory (and make sure Java has access to it). Solr is quite memory-hungry, if you put it on a memory-constrained server, performance will suffer.
  • As the Solr wiki explains, use sharding if a single query takes too long to run, and replication if a single Solr instance can't handle the traffic. "Too long" and "traffic" depend on your particular application. Measure them.
  • Solr has lots of settings that affect performance: cache auto-warming, stored fields, merge factor, etc. Check out SolrPerformanceFactors.
  • There are no hard rules here. Every application has different search needs. Simulate and measure for your particular scenario.
  • About the replication error, make sure you're running 1.4.1 since 1.4.0 had a bug with replication.
like image 50
Mauricio Scheffer Avatar answered Sep 22 '22 19:09

Mauricio Scheffer