Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

java - MongoDB + Solr performances

Tags:

java

mongodb

solr

I've been looking around a lot to see how to use MongoDB in combination with Solr, and some questions here have partial responses, but nothing really concrete (more like theories). In my application, I will have lots and lots of documents stored in MongoDB (maybe up to few hundred millions), and I want to implement full-text searches on some properties of those documents, so I guess Solr is the best way to do this.

What I want to know is how should I configure/execute everything so that it has good performances? right now, here's what I do (and I know its not optimal):

1- When inserting an object in MongoDB, I then add it to Solr

SolrServer server = getServer();
SolrInputDocument document = new SolrInputDocument();
document.addField("id", documentId);
...
server.add(document);
server.commit();

2- When updating a property of the object, since Solr cannot update just one field, first I retrieve the object from MongoDB then I update the Solr index with all properties from object and new ones and do something like

StreamingUpdateSolrServer update = new StreamingUpdateSolrServer(url, 1, 0);
SolrInputDocument document = new SolrInputDocument();
document.addField("id", documentId);
...
update.add(document);
update.commit();

3- When querying, first I query Solr and then when retrieving the list of documents SolrDocumentList I go through each document and:

  1. get the id of the document
  2. get the object from MongoDB having the same id to be able to retrieve the properties from there

4- When deleting, well I haven't done that part yet and not really sure how to do it in Java

So anybody has suggestions on how to do this in more efficient ways for each of the scenarios described here? like the process to do it in a way that it won't take 1hour to rebuild the index when having a lot of documents in Solr and adding one document at a time? my requirements here are that users may want to add one document at a time, many times and I'd like them to be able to retrieve it right after

like image 857
Guillaume Avatar asked Aug 25 '11 16:08

Guillaume


2 Answers

Your approach is actually good. Some popular frameworks like Compass are performing what you describe at a lower level in order to automatically mirror to the index changes that have been performed via the ORM framework (see http://www.compass-project.org/overview.html).

In addition to what you describe, I would also regularly re-index all the data which lives in MongoDB in order to ensure both Solr and Mongo are sync'd (probably not as long as you might think, depending on the number of document, the number of fields, the number of tokens per field and the performance of the analyzers : I often create index from 5 to 8 millions documents (around 20 fields, but text fields are short) in less than 15 minutes with complex analyzers, just ensure your RAM buffer is not too small and do not commit/optimize until all documents have been added).

Regarding performance, a commit is costly and an optimize is very costly. Depending on what matters the most to you, you could change the value of mergefactor in Solrconfig.xml (high values improve write performance whereas low values improve read performance, 10 is a good value to start with).

You seem to be afraid of the index build time. However, since Lucene indexes storage is segment-based, the write throughput should not depend too much on the size of the index (http://lucene.apache.org/java/2_3_2/fileformats.html). However, the warm-up time will increase, so you should ensure that

  • there are typical (especially for sorts in order to load the fieldcaches) but not too complex queries in the firstSearcher and newSearcher parameters in your solrconfig.xml config file,
  • useColdSearcher is set to
    • false in order to have good search performance, or
    • true if you want changes performed to the index to be taken faster into account at the price of a slower search.

Moreover, if it is acceptable for you if the data becomes searchable only a few X milliseconds after it has been written to MongoDB, you could use the commitWithin feature of UpdateHandler. This way Solr will have to commit less often.

For more information about Solr performance factors, see http://wiki.apache.org/solr/SolrPerformanceFactors

To delete documents, you can either delete by document ID (as defined in schema.xml) or by query : http://lucene.apache.org/solr/api/org/apache/solr/client/solrj/SolrServer.html

like image 66
jpountz Avatar answered Nov 17 '22 21:11

jpountz


  1. You can also wait for more documents and indexing them only each X minutes. (Of course this highly depend of your application & requirements)

  2. If your documents are small and you don't need all data (which are stored in MongoDB) you can put only the field you need in the Solr Document by storing them but not indexing

<field name="nameoyourfield" type="stringOrAnyTypeYouuse"indexed="false"stored="true"/>

like image 29
Aurélien B Avatar answered Nov 17 '22 21:11

Aurélien B