Are RDBMS that bad as described in Hadoop: The definitive guide?

Tags:

I'm reading Hadoop: The definitive guide by Tom White. In chapter 13.6 "HBase vs RDMS" he said that if you have a lot of data, even simple queries like getting 10 recent items are extreamly expensive and they had to rewrite them using python and PL/SQL.

He gives the following query as an example:

SELECT id, stamp, type FROM streams 
WHERE type IN ('type1','type2','type3','type4',...,'typeN')
ORDER BY stamp DESC LIMIT 10 OFFSET 0;

And says: "an RDBMS query planner treats this query as follows:

MERGE (
  SELECT id, stamp, type FROM streams
    WHERE type = 'type1' ORDER BY stamp DESC,
  ...,
  SELECT id, stamp, type FROM streams
    WHERE type = 'typeK' ORDER BY stamp DESC
) ORDER BY stamp DESC LIMIT 10 OFFSET 0;

The problem here is that we are after only the top 10 IDs, but the query planner actually materializes an entire merge and then limits at the end. .... We actually went so far as to write a custom PL/Python script that performed a heapsort. ... In nearly all cases, this outperformed the native SQL implementation and the query planner’s strategy...

Expected perforamnce and expermiental results

I couldn't imagine the data set that will cause such problems that you have to write pl/python to do such simple query right. So I've played for a while about this problem and came up with following observations:

The performance of such query is be bounded by O(KlogN). Because it can be translated to so something as follows:

SELECT * FROM (
  SELECT id, stamp, type FROM streams
    WHERE type = 'type1' ORDER BY stamp DESC LIMIT 10,
  UNION
  ...,
  SELECT id, stamp, type FROM streams
    WHERE type = 'typeK' ORDER BY stamp DESC LIMIT 10
) t ORDER BY stamp DESC LIMIT 10;

(note the 'LIMIT 10' at each query. BTW I know that I can't limit and order unions but i've stripped out wrapping selects for sake of readability)

Each subquery should run as fast as finding the right postion in an index O(logN) and returning 10 items. If we repeat that K times we get O(KlogN).

And even if query planner is so bad that it can not optimize the first query we can always translate it to the query with unions and get the desired performance without writing anything in pl/python.

To double check my calculations I've run the queries above one postgresql filled with 9,000,000 of test records. The results confirmed my expectations both queries were quite fast 100ms for the first query and 300ms for second (the one with unions).

So if the query runs in 100ms for 9,000,000 (logn=23) of records then for 9,000,000,000 (logn=33) of records it should run in 140ms.

Questions

Do you see any flaws in above reasoning?
Can you imagine a data set where you would need to rewrite such query as above in pl/python?
Do you see any situation in which such query wouldn't work in O(K log n)?

903

asked Nov 26 '10 22:11

Piotr Czapla

2 Answers

Their assertion that an RDMBS query planner takes that solution to the query is incorrect, at least for Postgresql 9.0, and I should imagine for other platforms too. I did a quick test with a similar query:

explain select * from client_attribute where client_attribute_type_code in ('UAG', 'RFR', 'IPA', 'FVD') order by client_attribute_id desc limit 10;

                                                      QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.00..0.93 rows=10 width=85)
   ->  Index Scan Backward using client_attribute_pkey on client_attribute  (cost=0.00..15516.47 rows=167234 width=85)
         Filter: (client_attribute_type_code = ANY ('{UAG,RFR,IPA,FVD}'::bpchar[]))
(3 rows)

Here client_attribute_id is indexed, so it does exactly as desired- walks back through the index, applies the filter and stops when the output hits the limit.

If the ordering column is not indexed, a table scan and sort is requierd, but only one table scan:

explain analyze select * from client_attribute where client_attribute_type_code in ('UAG', 'RFR', 'IPA', 'FVD') order by updated desc limit 10;

                                                              QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=13647.00..13647.03 rows=10 width=85) (actual time=180.961..180.964 rows=10 loops=1)
   ->  Sort  (cost=13647.00..14065.09 rows=167234 width=85) (actual time=180.960..180.961 rows=10 loops=1)
         Sort Key: updated
         Sort Method:  top-N heapsort  Memory: 26kB
         ->  Seq Scan on client_attribute  (cost=0.00..10033.14 rows=167234 width=85) (actual time=0.010..106.791 rows=208325 loops=1)
               Filter: (client_attribute_type_code = ANY ('{UAG,RFR,IPA,FVD}'::bpchar[]))

This uses a heapsort to maintain the top 10 results through the course of the sequential scan, which sounds exactly like the solution they wrote themselves.

141

answered Oct 19 '22 11:10

araqnid

I don't think that Tom White is saying that relational databases are "bad"; they aren't optimal for non-relational, non-set based data.

It's been well known for a long time that deep object graphs don't lend themselves well to relational databases. They're typically found in problems like CAD representations of geometric data, where assemblies are made up of assemblies of assemblies of parts. The reference chains are very long, indeed.

Object and graph databases have been the solutions to that kind of problems since I was aware of them in the early 90s.

Relational databases are terrific for relational, set-based data. But all data doesn't fall into that category. That's why NoSQL is gaining mind share.

I think that's what the example you cite is saying.

answered Oct 19 '22 11:10

duffymo

Related questions
                            
                                Specify order of columns in SELECT with UNION using Django ORM
                            
                                Is there access control for CouchDB databases like PostgreSQL and MySQL privilege management?
                            
                                Why do SQL id sequences go out of sync (specifically using Postgres)?
                            
                                Get values from RETURNING * within a transaction
                            
                                hierarchical sum in PostgreSQL
                            
                                PgAdmin sometimes crash when clicking SQL query button
                            
                                Returning result from SELECT with node-postgres
                            
                                Best way to install hstore on multiple schemas in a Postgres database?
                            
                                Updating json field in Postgres
                            
                                How to translate the PostgreSQL array_agg function to SQLite?
                            
                                How to index jsonb integer values
                            
                                Creating databases in SQLAlchemy tests with PostgreSQL
                            
                                JDBC Prepared statement parameter inside json
                            
                                Fully Parsing Timestamps in Golang
                            
                                get JOIN table as array of results with PostgreSQL/NodeJS
                            
                                Exporting a AWS Postgres RDS Table to AWS S3
                            
                                List tables within a Postgres schema using R
                            
                                How to allow postgres login without password prompt on local connection?
                            
                                Connect with postgreSQL schema
                            
                                Create sql table from dask dataframe using map_partitions and pd.df.to_sql

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Are RDBMS that bad as described in Hadoop: The definitive guide?

Tags:

postgresql

nosql

rdbms

hbase

Piotr Czapla

People also ask

2 Answers

araqnid

duffymo

Recent Activity

Donate For Us