Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

SQL: Inner joining two massive tables

I have two massive tables with about 100 million records each and I'm afraid I needed to perform an Inner Join between the two. Now, both tables are very simple; here's the description:

BioEntity table:

  • BioEntityId (int)
  • Name (nvarchar 4000, although this is an overkill)
  • TypeId (int)

EGM table (an auxiliar table, in fact, resulting of bulk import operations):

  • EMGId (int)
  • PId (int)
  • Name (nvarchar 4000, although this is an overkill)
  • TypeId (int)
  • LastModified (date)

I need to get a matching Name in order to associate BioEntityId with the PId residing in the EGM table. Originally, I tried to do everything with a single inner join but the query appeared to be taking way too long and the logfile of the database (in simple recovery mode) managed to chew up all the available disk space (that's just over 200 GB, when the database occupies 18GB) and the query would fail after waiting for two days, If I'm not mistaken. I managed to keep the log from growing (only 33 MB now) but the query has been running non-stop for 6 days now and it doesn't look like it's gonna stop anytime soon.

I'm running it on a fairly decent computer (4GB RAM, Core 2 Duo (E8400) 3GHz, Windows Server 2008, SQL Server 2008) and I've noticed that the computer jams occasionally every 30 seconds (give or take) for a couple of seconds. This makes it quite hard to use it for anything else, which is really getting on my nerves.

Now, here's the query:

 SELECT EGM.Name, BioEntity.BioEntityId INTO AUX  FROM EGM INNER JOIN BioEntity   ON EGM.name LIKE BioEntity.Name AND EGM.TypeId = BioEntity.TypeId 

I had manually setup some indexes; both EGM and BioEntity had a non-clustered covering index containing TypeId and Name. However, the query ran for five days and it did not end either, so I tried running Database Tuning Advisor to get the thing to work. It suggested deleting my older indexes and creating statistics and two clustered indexes instead (one on each table, just containing the TypeId which I find rather odd - or just plain dumb - but I gave it a go anyway).

It has been running for 6 days now and I'm still not sure what to do... Any ideas guys? How can I make this faster (or, at least, finite)?

Update: - Ok, I've canceled the query and rebooted the server to get the OS up and running again - I'm rerunning the workflow with your proposed changes, specifically cropping the nvarchar field to a much smaller size and swapping "like" for "=". This is gonna take at least two hours, so I'll be posting further updates later on

Update 2 (1PM GMT time, 18/11/09): - The estimated execution plan reveals a 67% cost regarding table scans followed by a 33% hash match. Next comes 0% parallelism (isn't this strange? This is the first time I'm using the estimated execution plan but this particular fact just lifted my eyebrow), 0% hash match, more 0% parallelism, 0% top, 0% table insert and finally another 0% select into. Seems the indexes are crap, as expected, so I'll be making manual indexes and discard the crappy suggested ones.

like image 880
João Pereira Avatar asked Nov 17 '09 16:11

João Pereira


People also ask

What is the most efficient way of joining 2 table in same database?

Relational algebra is the most common way of writing a query and also the most natural way to do so. The code is clean, easy to troubleshoot, and unsurprisingly, it is also the most efficient way to join two tables.

Can you inner join multiple tables in SQL?

In this case the two tables are joined using the relationship table1.id = table2.id . It is possible to use multiple join statements together to join more than one table at the same time. To do that you add a second INNER JOIN statement and a second ON statement to indicate the third table and the second relationship.


2 Answers

I'm not an SQL tuning expert, but joining hundreds of millions of rows on a VARCHAR field doesn't sound like a good idea in any database system I know.

You could try adding an integer column to each table and computing a hash on the NAME field that should get the possible matches to a reasonable number before the engine has to look at the actual VARCHAR data.

like image 200
Larry Lustig Avatar answered Oct 13 '22 20:10

Larry Lustig


For huge joins, sometimes explicitly choosing a loop join speeds things up:

SELECT EGM.Name, BioEntity.BioEntityId INTO AUX FROM EGM  INNER LOOP JOIN BioEntity      ON EGM.name LIKE BioEntity.Name AND EGM.TypeId = BioEntity.TypeId 

As always, posting your estimated execution plan could help us provide better answers.

EDIT: If both inputs are sorted (they should be, with the covering index), you can try a MERGE JOIN:

SELECT EGM.Name, BioEntity.BioEntityId INTO AUX FROM EGM  INNER JOIN BioEntity      ON EGM.name LIKE BioEntity.Name AND EGM.TypeId = BioEntity.TypeId OPTION (MERGE JOIN) 
like image 21
Andomar Avatar answered Oct 13 '22 20:10

Andomar