Why is Solr so much faster than Postgres?

Q: Why Postgres is faster?

Speed however is a benchmark that will be decided based on how the Database is being utilized. PostgreSQL is faster when dealing with massive datasets, complicated queries, and read-write operations. On the other hand, MySQL is known to be faster for read-only commands.

Q: Is PostgreSQL faster?

It's been a big year for open source relational database PostgreSQL. Stack Overflow developers named it the Most Wanted Database in 2021 and DB-Engines named it the Database Management System of the Year for the third time.

Tags:

performance

postgresql

solr

lucene

rdbms

I recently switched from Postgres to Solr and saw a ~50x speed up in our queries. The queries we run involve multiple ranges, and our data is vehicle listings. For example: "Find all vehicles with mileage < 50,000, $5,000 < price < $10,000, make=Mazda..."

I created indices on all the relevant columns in Postgres, so it should be a pretty fair comparison. Looking at the query plan in Postgres though it was still just using a single index and then scanning (I assume because it couldn't make use of all the different indices).

As I understand it, Postgres and Solr use vaguely similar data structures (B-trees), and they both cache data in-memory. So I'm wondering where such a large performance difference comes from.

What differences in architecture would explain this?

457

asked Apr 07 '12 08:04

cberner

1 Answers

First, Solr doesn't use B-trees. A Lucene (the underlying library used by Solr) index is made of a read-only segments. For each segment, Lucene maintains a term dictionary, which consists of the list of terms that appear in the segment, lexicographically sorted. Looking up a term in this term dictionary is made using a binary search, so the cost of a single-term lookup is O(log(t)) where t is the number of terms. On the contrary, using the index of a standard RDBMS costs O(log(d)) where d is the number of documents. When many documents share the same value for some field, this can be a big win.

Moreover, Lucene committer Uwe Schindler added support for very performant numeric range queries a few years ago. For every value of a numeric field, Lucene stores several values with different precisions. This allows Lucene to run range queries very efficiently. Since your use-case seems to leverage numeric range queries a lot, this may explain why Solr is so much faster. (For more information, read the javadocs which are very interesting and give links to relevant research papers.)

But Solr can only do this because it doesn't have all the constraints that a RDBMS has. For example, Solr is very bad at updating a single document at a time (it prefers batch updates).

123

answered Oct 12 '22 22:10

jpountz

Related questions
                            
                                Best way to execute js only on specific page
                            
                                Performance of static methods vs. functions
                            
                                Why is looping through an Array so much faster than JavaScript's native `indexOf`?
                            
                                How to convert a huge list-of-vector to a matrix more efficiently?
                            
                                Fastest way to remove non-numeric characters from a VARCHAR in SQL Server
                            
                                Performance differences between visibility:hidden and display:none
                            
                                Should LINQ be avoided because it's slow? [closed]
                            
                                Performance of "direct" virtual call vs. interface call in C#
                            
                                Optimal way to compute pairwise mutual information using numpy
                            
                                Large public datasets? [closed]
                            
                                Get all files and directories in specific path fast
                            
                                Python threads all executing on a single core
                            
                                Do you quote HTML5 attributes? [closed]
                            
                                What is the fastest way to parse large XML docs in Python?
                            
                                How can I distinguish between high- and low-performance cores/threads in C++?
                            
                                JavaScript loop performance - Why is to decrement the iterator toward 0 faster than incrementing
                            
                                MySQL: Many tables or many databases?
                            
                                Are C++ enums slower to use than integers?
                            
                                boost serialization vs google protocol buffers? [closed]
                            
                                In what way does denormalization improve database performance?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With