Yes, you can use SOLR as a database but there are some really serious caveats : SOLR's most common access pattern, which is over http doesnt respond particularly well to batch querying. Furthermore, SOLR does NOT stream data --- so you can't lazily iterate through millions of records at a time.
Solr is a search engine at heart, but it is much more than that. It is a NoSQL database with transactional support. It is a document database that offers SQL support and executes it in a distributed manner.
In general, indexing is an arrangement of documents or (other entities) systematically. Indexing enables users to locate information in a document. Indexing collects, parses, and stores documents. Indexing is done to increase the speed and performance of a search query while finding a required document.
After you post all your documents, call commit once manually or from SolrJ - it will take a while to commit, but this will be much faster overall. Also after you are done with your bulk import, reduce maxTime and maxDocs , so that any incremental posts you will do to Solr will get committed much sooner.
Yes, you can use SOLR as a database but there are some really serious caveats :
SOLR's most common access pattern, which is over http doesnt respond particularly well to batch querying. Furthermore, SOLR does NOT stream data --- so you can't lazily iterate through millions of records at a time. This means you have to be very thoughtful when you design large scale data access patterns with SOLR.
Although SOLR performance scales horizontally (more machines, more cores, etc..) as well as vertically (more RAM, better machines, etc), its querying capabilities are severely limited compared to those of a mature RDBMS. That said, there are some excellent functions, like the field stats queries, which are quite convenient.
Developers who are used to using relational databases will often run into problems when they use the same DAO design patterns in a SOLR paradigm, because of the way SOLR uses filters in queries. There will be a learning curve for developing the right approach to building an application that uses SOLR for part of its large queries or statefull modifications.
The "enterprisy" tools that allow for advanced session management and statefull entities that many advanced web-frameworks (Ruby, Hibernate, ...) offer will have to be thrown completely out the window.
Relational databases are meant to deal with complex data and relationships - and they are thus accompanied by state of the art metrics and automated analysis tools. In SOLR, I've found myself writing such tools and manually stress-testing alot, which can be a time sink.
Joining : this is the big killer. Relational databases support methods for building and optimizing views and queries that join tuples based on simple predicates. In SOLR, there aren't any robust methods for joining data across indices.
Resiliency : For high availability, SolrCloud uses a distributed file system underneath (i.e. HCFS). This model is quite different then that of a relational database, which usually does resiliency using slaves and masters, or RAID, and so on. So you have to be ready to provide the resiliency infrastructure SOLR requires if you want it to be cloud scalable and resistent.
That said - there are plenty of obvious advantages to SOLR for certain tasks : (see http://wiki.apache.org/solr/WhyUseSolr) -- loose queries are much easier to run and return meaningful results. Indexing is done as a matter of default, so most arbitrary queries run pretty effectively (unlike a RDBMS, where you often have to optimize and de-normalize after the fact).
Conclusion: Even though you CAN use SOLR as an RDBMS, you may find (as I have) that there is ultimately "no free lunch" - and the cost savings of super-cool lucene text-searches and high-performance, in-memory indexing, are often paid for by less flexibility and adoption of new data access workflows.
It's perfectly reasonable to use Solr as a database, depending on your application. In fact, that's pretty much what guardian.co.uk is doing.
It's definitely not bad practice per se. It's only bad if you use it the wrong way, just like any other tool at any level, even GOTOs.
When you say "An XML representation..." I assume you're talking about having multiple stored Solr fields and retrieving this using Solr's XML format, and not just one big XML-content field (which would be a terrible use of Solr). The fact that Solr uses XML as default response format is largely irrelevant, you can also use a binary protocol, so it's quite comparable to traditional relational databases in that regard.
Ultimately, it's up to your application's needs. Solr is primarily a text search engine, but can also act as a NoSQL database for many applications.
This was probably done for performance reasons, if it doesn't cause any problems I would leave it alone. There is a big grey area of what should be in a traditional database vs a solr index. Ive seem people do similar things to this (usually key value pairs or json instead of xml) for UI presentation and only get the real object from the database if needed for updates/deletes. But all reads just go to Solr.
I've seen similar things done because it allows for very fast lookup. We're moving data out of our Lucene indexes into a fast key-value store to follow DRY principles and also decrease the size of the index. There's not a hard-and-fast rule for this sort of thing.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With