Relational v Hierarchical data models

Tags:

When the relational model was put forward by F.E. Codd, the established databases of the time used the hierarchical model. My understanding is that the relational model was felt to be a significant improvement on the hierarchical approach.

My intuition is that this "makes sense" for a few reasons.

The relational model seems to be "query agnostic" in that instead of the data's shape reflecting the way that you're likely to query it, it is instead structured so that any question can be asked relatively easily.
The relational model also makes mutability simple. You assert or retract facts by adding rows to a table (adding tuples to a set), or removing them. In contrast, in a hierarchical setting you need to add or remove from some other object, which introduces secondary questions such as does the parent object need to be created if it doesn't exist, or removed if it is empty.
The relational model readily models relationships that don't easily fit in to a parent child approach, such as the relation between three entities.
The relational model seems like it is better suited to schema growth as new kinds of fact can be added using new tables. Doing so with some care need not disrupt existing tables (and facts), or services that depend on them.

However, while it feels like the relational data model has advantages, I'd like to have some insight on why it was definitely believed to be significantly superior at the time, and presumably, still is.

I'd really appreciate the arguments in some kind of distilled form, or ideally, one or more papers or other documents, or a canonical reference that go through the reasoning behind this.

For clarity, I'm not asking about actual implementations of either approach, or their relative resource use in terms of storage or compute, unless this is of great salience to the answer.

Thanks.

524

asked Sep 20 '17 12:09

Benjohn

2 Answers

It's not quite right to say "established databases of the time used the hierarchical model".

Firstly (to nit-pick), it's database management systems that do/don't use some physical structure. "databases" -- that is database designs might use all sorts of abstractions. Entity-Relational modelling was and still is popular as a design tool irrespective of the eventual physical platform.

Secondly, at the time whilst hierarchical models were usual for 'big iron' large databases, indexed-sequential was far more common on what used to be called 'mini-computers' (like DEC PDP-8/-11; IBM System/34,/36; ICL 1900/ME29; Honeywell DPS4/DPS7).

We might say indexed-sequential organisation on disk grew out of punched-card or magtape systems using batch update-by-copy. That's where the "sequential" comes from.

You say you don't want to ask about the actual implementation; but the answer is all about actual implementation. Reading disk sequentially is more efficent than random access (which needs the read-head to jump about). That's why in contrast to disk, memory is called "Random Access Memory". (This was long before RAM became so cheap we could keep a whole database in memory.)

Similarly the hierarchical model was organised to provide rapid access for commonly-needed query paths. The hierarchy put closely-linked nodes on the same patch of physical disk. So it was easy to navigate from Customer to that Customer's Orders to that Order's Item Lines.

The downside was it was difficult to navigate 'across' the hierarchy -- for example to find all the Order Lines for Item P5432, irrespective of which Customer/Order. (Furthermore if you then want to retrieve the Customers who are ordering P5432, you need to work 'backwards' up the hierarchy. If it's all on the same patch of disk, hopefully you don't need to look too far/perhaps it's in the same disk bucket loaded to RAM.)

Similarly the indexed-sequential organisation favoured one particular index -- the primary key. If you wanted to search by Customer name rather than number, that required a 'secondary index' with all sorts of ugly organisation to keep the index buckets somewhere near the data. And the notorious 'bucket overflow' which could stop a machine dead as you amended one teensy weensy spelling mistake in a name, so shifting it to an entirely different alphabetic position.

(By the way, NoSQL databases, being key-value stores with only one key, seem destined to fall into all those traps to do with secondary indexing. They need a second key-value store to provide an alternative index, with all sorts of fun keeping them in synch. Back to the future!)

The biggest problem Codd had in implementing the Relational Model was to persuade IBM top brass that the model could efficiently support querying through multiple 'access paths'. You'll see a lot of his early papers talking about abstracting the 'navigation' away from the query-writer/programmer. In fact the original System/R design had lots of compromises because

a) the IBM Engineers just didn't understand the mathematical abstractions Codd was talking about;

b) they were scared shitless it would perform like a dog and they'd lose their jobs.

[ahem! personal opinion: but the group got together long after, and there's some reminiscences somewhere around on the web.] Those compromises have persisted 'til today in SQL; which is frankly a pile of crud and should have been killed off as merely an interesting proof of concept.

How did Codd's model succeed (or rather the SQL model, not Codd's)?

Disk technology improved -- particularly seek times
somebody figured out hash-indexing and b-trees, and keeping all the indexes for a table in separate memory to the actual data; rather than trying to hold it like a magtape serial store.
Larry Ellison sniffed out something was afoot, and stole members of the IBM engineering team to build the same thing at Oracle. Also Michael Stonebreaker formed Ingres.

The race was on! There was no time to stop and get everything right. Implement what you've got (i.e. the SQL proof of concept) and rush it to market, ready or not. (Sound like a familiar story?)

Your points about the superiority of the Relational Model are all well-made. They essentially follow from normalisation techniques. I would say, though, that they weren't well understood in the late '70's/'80's. Schema designs looked a lot like the hierarchical or indexed-sequential data models, just transposed to 'flat' tables. In particular, there was a tendency to design 'wide' tables to bring together on one patch of disk everything we know about some Customer, rather than vertically partitioning. (For fear of performance hits from joining together the partitions.) That meant a lot of not-applicable or 'not known' fields -- which is the abomination of SQL's null.

So your "improvements" are as yet only partially attained. One day perhaps we'll see a DBMS engineered to the Relational Model. For now we'll have to put up with SQL.

193

answered Dec 24 '22 04:12

AntC

AntC gives a super answer with lots of historical background. I wanted to collect an answer from links made in comments to ensure they're not missed.

An absolutely terrific read is the write up of Codd's 1981 Turing Award Lecture. Thanks to reaanb for suggesting it.

In the paper, Codd sets out the case against contemporary databases thus:

These systems burdened application programmers with numerous concepts that were irrelevant to their data retrieval and manipulation tasks, forcing them to think and code at a needlessly low level of structural detail (the "owner-member set" of CODASYL DBTG is an outstanding example.

No commands were provided for processing multiple records at a time--in other words, DBMS did not support set processing and, as a result, programmers were forced to think and code in terms of iterative loops that were often unnecessary (here we use the word "set" in its traditional mathematical sense, not the linked structure sense of CODASYL DBTG};

The needs of end users for direct interaction with databases, particularly interaction of an unanticipated nature, were inadequately recognised. A query capability was assumed to be something one could add on to a DBMS at some later time.

With the problem described, shortly after Codd lays out his requirements for a replacement – the objectives that he wants it to fulfil.

The most important motivation for the research work that resulted in the relational model was the objective of providing a sharp and clear boundary between the logical and physical aspects of database management including database design, data retrieval, and data manipulation. We call this the data independence objective.

A second objective was to make the model structurally simple, so that all kinds of users and programmers could have a common understanding of the data, and could therefore communicate with one another about the database. We call this the communicability objective.

A third objective was to introduce high-level language concepts but not specific syntaxI to enable users to express operations upon large chunks of information at a time. This entailed providing a foundation for set oriented processing i.e., the ability to express in a single statement the processing of multiple sets of records at a time. We call this the set-processing objective.

There were other objectives, such as providing a sound theoretical foundation for database organization and management, but these objectives are less relevant to our present productivity theme.

And then finally he introduces the relational model and shows how it meets the desired objectives and solves the objections to the contemporary systems.

In my view this paper remains extremely relevant.

It gives a strong sense of why Codd was proposing the Relational approach, and it actually throws down the gauntlet for current systems to match his expectations and hopes. I think they fall short, particularly in the way we tend to deploy in heterogeneous / polyglot client server systems. Can we claim that they overcome Codd's objections?

The paper is an approachable and easy read for anyone with only passing familiarity with the field.

A final rather extraordinary extract from the paper stands as a warning to orthodoxy (emphasis mine):

The relational algebra, which includes these and other operators, is intended as a yardstick of power. It is not intended to be standard language, to which all relational systems should adhere. The set-processing objective of the relational model is intended to be met by means of a data sublanguages having at least the power of the relational algebra without making use of iteration or recursion statements.

So Codd is not claiming that the relational model is the best or only approach. He says it has necessary properties to achieve his objectives, but need not be the final word in database design.

It's also worth looking at Codd's Original Paper on relational databases. Thanks to philipxy for this. It is also a readable paper and has plenty of justification and background. It does have more formal content, but there's nothing you won't be able to get a handle on if you've some understanding of relational databases.

answered Dec 24 '22 05:12

Benjohn

Related questions
                            
                                jOOQ H2 Case sensitivity issues without codegen
                            
                                What is relational data? [closed]
                            
                                Enforce Multiple Relationships Between Two Tables In Access
                            
                                Database design structure for storing courses chapters topics subtopics
                            
                                What are the pros and cons for using a database to store configuration information compared to a File System
                            
                                Storing inherited objects in a database
                            
                                How to map standard Java types to SQL types?
                            
                                Simulating relational data (or a database) in Python?
                            
                                Multiple toMany relationships to a single table
                            
                                Amazon AWS RDS: how to make the database Publicly Accessible to the internet
                            
                                SQL query to count most "popular" value?
                            
                                Why is multi-value field a bad idea in relational databases
                            
                                Can SQLAlchemy add new records and automate the relationship between tables without me manually checking for PK uniqueness?
                            
                                Alternative to using subquery inside CHECK constraint?
                            
                                Relational data modeling for sub types
                            
                                ERM: Cardinality in ternary relationships
                            
                                What is the difference between SQL Server Compact Edition and SQL Server Express Edition?
                            
                                Is RavenDb performance better in comparison to relational databases? [closed]
                            
                                Do the "columns" in a table in a RMDB have order?
                            
                                How to use 'hasManyThrough' with more than 3 tables in Laravel?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Relational v Hierarchical data models

Tags:

relational-database

hierarchical-data

relational-algebra

Benjohn

People also ask

2 Answers

AntC

Benjohn

Recent Activity

Donate For Us