Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Life without JOINs... understanding, and common practices

Lots of "BAW"s (big ass-websites) are using data storage and retrieval techniques that rely on huge tables with indexes, and using queries that won't/can't use JOINs in their queries (BigTable, HQL, etc) to deal with scalability and sharding databases. How does that work when you have lots and lots of data that is very related?

I can only speculate that much of this joining has to be done on the application side of things, but doesn't that start to get expensive? What if you have to make several queries to several different tables to get information to compile? Isn't hitting the database that many times starting to get more expensive than just using joins in the first place? I guess it depends on how much data you've got?

And for commonly available ORMs, how do they tend to deal with the inability to use joins? Is there support for this in ORMs that are in heavy usage today? Or do most projects that have to approach this level of data tend to roll their own anyways?

So this is not applicable to any current project I'm doing, but it's something that's been in my head for several months now that I can only speculate as to what "best practices" are. I've never had a need to address this in any of my projects because they've never reached a scale where it is required. Hopefully this question helps other people as well..

As someone said below, ORMs "don't work" without joins. Are there other data access layers that are already available to developers working with data on this level?

EDIT: For some clarification, Vinko Vrsalovic said:

"I believe snicker is wants to talk about NO-SQL, where transactional data is denormalized and used in Hadoop or BigTable or Cassandra schemes."

This is indeed what I'm talking about.

Bonus points for those who catch the xkcd reference.

like image 360
snicker Avatar asked Oct 07 '09 15:10

snicker


People also ask

What is a real life example of a query?

For example, an human resources manager may perform a query on an employee database that selects all employees in a specific department that were hired between 11 and 12 months ago.

Why do you think join statements are important for working with databases?

That is exactly why we need database joins. Joins are used to stitch the database back together to make it easy to read and use. They match rows between tables. In most cases we're matching a column value from one table with another.

What is the most common type of join?

The simplest and most common form of a join is the SQL inner join the default of the SQL join types used in most database management systems. It's the default SQL join you get when you use the join keyword by itself. The result of the SQL inner join includes rows from both the tables where the join conditions are met.

What are Joins in SQL explain all the join with their real world application?

Introduction to JOIN With relational databases, the information you want is often stored in several tables. In such scenarios, you'll need to join these tables. This is where the SQL JOIN comes into play. The JOIN clause in SQL is used to combine rows from several tables based on a related column between these tables.


2 Answers

The way I look at it, a relational database is a general purpose tool to hedge your bets. Modern computers are fast enough, and RDBMS' are well-optimized enough that you can grow to quite a respectable size on a single box. By choosing an RDBMS you are giving yourself very flexible access to your data, and the ability to have powerful correctness constraints that make it much easier to code against the data. However the RDBMS is not going to represent a good optimization for any particular problem, it just gives you the flexibility to change problems easily.

If you start growing rapidly and realize you are going to have to scale beyond the size of a single DB server, you suddenly have much harder choices to make. You will need to start identifying bottlenecks and removing them. The RDBMS is going to be one nasty snarled knot of codependency that you'll have to tease apart. The more interconnected your data the more work you'll have to do, but maybe you won't have to completely disentangle the whole thing. If you're read-heavy maybe you can get by with simple replication. If you're saturating your market and growth is leveling off maybe you can partially denormalize and shard to fixed number of DB servers. Maybe you just have a handful of problem tables that can be moved to a more scalable data store. Maybe your usage profile is very cache friendly and you can just migrate the load to a giant memcached cluster.

Where scalable key-value stores like BigTable come in is when none of the above can work, and you have so much data of a single type that even when it's denormalized a single table is too much for one server. At this point you need to be able to partition it arbitrarily and still have a clean API to access it. Naturally when the data is spread out across so many machines you can't have algorithms that require these machines to talk to each other much, which many of the standard relational algorithms would require. As you suggest, these distributed querying algorithms have the potential to require more total processing power than the equivalent JOIN in a properly indexed relational database, but because they are parallelized the real time performance is orders of magnitude better than any single machine could do (assuming a machine that could hold the entire index even exists).

Now once you can scale your massive data set horizontally (by just plugging in more servers), the hard part of scalability is done. Well I shouldn't say done, because ongoing operations and development at this scale are a lot harder than the single-server app, but the point is application servers are typically trivial to scale via a share-nothing architecture as long as they can get the data they need in a timely fashion.

To answer your question about how commonly used ORMs handle the inability to use JOINs, the short answer is they don't. ORM stands for Object Relational Mapping, and most of the job of an ORM is just translating the powerful relational paradigm of predicate logic simple object-oriented data structures. Most of the value of what they give you is simply not going to be possible from a key-value store. In practice you will probably need to build up and maintain your own data-access layer that's suited to your particular needs, because data profiles at these scales are going to vary dramatically and I believe there are too many tradeoffs for a general purpose tool to emerge and become dominant the way RDBMSs have. In short, you'll always have to do more legwork at this scale.

That said, it will definitely be interesting to see what kind of relational or other aggregate functionality can be built on top of the key-value store primitives. I don't really have enough experience here to comment specifically, but there is a lot of knowledge in enterprise computing about this going back many years (eg. Oracle), a lot of untapped theoretical knowledge in academia, a lot of practical knowledge at Google, Amazon, Facebook, et al, but the knowledge that has filtered out into the wider development community is still fairly limited.

However now that a lot of applications are moving to the web, and more and more of the world's population is online, inevitably more and more applications will have to scale, and best practices will begin to crystallize. The knowledge gap will be whittled down from both sides by cloud services like AppEngine and EC2, as well as open source databases like Cassandra. In some sense this goes hand in hand with parallel and asynchronous computation which is also in its infancy. Definitely a fascinating time to be a programmer.

like image 164
gtd Avatar answered Sep 16 '22 11:09

gtd


You're starting from a faulty assumption.

Data warehousing does not normalize data the same way that a transaction application normalizes. There are not "lots" of joins. There are relatively few.

In particular second and third Normal Form violations are not a "problem", since data warehouses are rarely updated. And when they are updated, it's generally only a status flag change to make a dimension rows as "current" vs. "not current".

Since you don't have to worry about updates, you don't decompose things down to the 2NF level where an update can't lead to anomalous relationships. No updates means no anomalies; and no decomposition and no joins. You can pre-join everything.

Generally, DW data is decomposed according to a star schema. This guides you to decompose the data into the numeric "fact" tables that contain the measures -- numbers with units -- and foreign key references to the dimension.

A dimension (or "business entity") is best thought of as a real-world thing with attributes. Often, this includes things like geography, time, product, customer, etc. These things often have complex hierarchies. The hierarchies are usually arbitrary, defined by various business reporting needs, and not modeled as separate tables, but simply columns in the dimension used for aggregation.


To address some of your questions.

"this joining has to be done on the application side of things". Kind of. The data is "pre-joined" prior to being loaded. The dimension data is often a join of relevant source data about that dimension. It's joined and loaded as a relatively flat structure.

It isn't updated. Instead of updates, additional historical records are inserted.

"but doesn't that start to get expensive?". Kind of. It takes some care to get the data loaded. However, there aren't a lot of reporting/analysis joins. The data is pre-joined.

The ORM issues are largely moot since the data is pre-joined. Your ORM maps to the fact or dimension as appropriate. Except in special cases, dimensions tend to be small-ish and fit entirely in memory. The exception is when you're in Finance (Banking or Insurance) or Public Utilities and have massive customer databases. These customer dimension rarely fits in memory.

like image 27
S.Lott Avatar answered Sep 18 '22 11:09

S.Lott