Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Are RDBMS that bad as described in Hadoop: The definitive guide?

I'm reading Hadoop: The definitive guide by Tom White. In chapter 13.6 "HBase vs RDMS" he said that if you have a lot of data, even simple queries like getting 10 recent items are extreamly expensive and they had to rewrite them using python and PL/SQL.

He gives the following query as an example:

SELECT id, stamp, type FROM streams 
WHERE type IN ('type1','type2','type3','type4',...,'typeN')
ORDER BY stamp DESC LIMIT 10 OFFSET 0;

And says: "an RDBMS query planner treats this query as follows:

MERGE (
  SELECT id, stamp, type FROM streams
    WHERE type = 'type1' ORDER BY stamp DESC,
  ...,
  SELECT id, stamp, type FROM streams
    WHERE type = 'typeK' ORDER BY stamp DESC
) ORDER BY stamp DESC LIMIT 10 OFFSET 0;

The problem here is that we are after only the top 10 IDs, but the query planner actually materializes an entire merge and then limits at the end. .... We actually went so far as to write a custom PL/Python script that performed a heapsort. ... In nearly all cases, this outperformed the native SQL implementation and the query planner’s strategy...

Expected perforamnce and expermiental results

I couldn't imagine the data set that will cause such problems that you have to write pl/python to do such simple query right. So I've played for a while about this problem and came up with following observations:

The performance of such query is be bounded by O(KlogN). Because it can be translated to so something as follows:

SELECT * FROM (
  SELECT id, stamp, type FROM streams
    WHERE type = 'type1' ORDER BY stamp DESC LIMIT 10,
  UNION
  ...,
  SELECT id, stamp, type FROM streams
    WHERE type = 'typeK' ORDER BY stamp DESC LIMIT 10
) t ORDER BY stamp DESC LIMIT 10;

(note the 'LIMIT 10' at each query. BTW I know that I can't limit and order unions but i've stripped out wrapping selects for sake of readability)

Each subquery should run as fast as finding the right postion in an index O(logN) and returning 10 items. If we repeat that K times we get O(KlogN).

And even if query planner is so bad that it can not optimize the first query we can always translate it to the query with unions and get the desired performance without writing anything in pl/python.

To double check my calculations I've run the queries above one postgresql filled with 9,000,000 of test records. The results confirmed my expectations both queries were quite fast 100ms for the first query and 300ms for second (the one with unions).

So if the query runs in 100ms for 9,000,000 (logn=23) of records then for 9,000,000,000 (logn=33) of records it should run in 140ms.

Questions

  • Do you see any flaws in above reasoning?
  • Can you imagine a data set where you would need to rewrite such query as above in pl/python?
  • Do you see any situation in which such query wouldn't work in O(K log n)?
like image 903
Piotr Czapla Avatar asked Nov 26 '10 22:11

Piotr Czapla


People also ask

Why Hadoop is better than RDBMS?

Following is the key difference between Hadoop and RDBMS: An RDBMS works well with structured data. Hadoop will be a good choice in environments when there are needs for big data processing on which the data being processed does not have dependable relationships.

Why RDBMS is not suitable for big data?

Reasons of RDBMS Failure to handle Big DataNot possible to stick to normalization. Automatic Sharding of data is almost impossible (nightmare). Very hard to achieve High Availability.

What is meant by RDBMS?

The software used to store, manage, query, and retrieve data stored in a relational database is called a relational database management system (RDBMS). The RDBMS provides an interface between users and applications and the database, as well as administrative functions for managing data storage, access, and performance.


2 Answers

Their assertion that an RDMBS query planner takes that solution to the query is incorrect, at least for Postgresql 9.0, and I should imagine for other platforms too. I did a quick test with a similar query:

explain select * from client_attribute where client_attribute_type_code in ('UAG', 'RFR', 'IPA', 'FVD') order by client_attribute_id desc limit 10;

                                                      QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.00..0.93 rows=10 width=85)
   ->  Index Scan Backward using client_attribute_pkey on client_attribute  (cost=0.00..15516.47 rows=167234 width=85)
         Filter: (client_attribute_type_code = ANY ('{UAG,RFR,IPA,FVD}'::bpchar[]))
(3 rows)

Here client_attribute_id is indexed, so it does exactly as desired- walks back through the index, applies the filter and stops when the output hits the limit.

If the ordering column is not indexed, a table scan and sort is requierd, but only one table scan:

explain analyze select * from client_attribute where client_attribute_type_code in ('UAG', 'RFR', 'IPA', 'FVD') order by updated desc limit 10;

                                                              QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=13647.00..13647.03 rows=10 width=85) (actual time=180.961..180.964 rows=10 loops=1)
   ->  Sort  (cost=13647.00..14065.09 rows=167234 width=85) (actual time=180.960..180.961 rows=10 loops=1)
         Sort Key: updated
         Sort Method:  top-N heapsort  Memory: 26kB
         ->  Seq Scan on client_attribute  (cost=0.00..10033.14 rows=167234 width=85) (actual time=0.010..106.791 rows=208325 loops=1)
               Filter: (client_attribute_type_code = ANY ('{UAG,RFR,IPA,FVD}'::bpchar[]))

This uses a heapsort to maintain the top 10 results through the course of the sequential scan, which sounds exactly like the solution they wrote themselves.

like image 141
araqnid Avatar answered Oct 19 '22 11:10

araqnid


I don't think that Tom White is saying that relational databases are "bad"; they aren't optimal for non-relational, non-set based data.

It's been well known for a long time that deep object graphs don't lend themselves well to relational databases. They're typically found in problems like CAD representations of geometric data, where assemblies are made up of assemblies of assemblies of parts. The reference chains are very long, indeed.

Object and graph databases have been the solutions to that kind of problems since I was aware of them in the early 90s.

Relational databases are terrific for relational, set-based data. But all data doesn't fall into that category. That's why NoSQL is gaining mind share.

I think that's what the example you cite is saying.

like image 24
duffymo Avatar answered Oct 19 '22 11:10

duffymo