I have two massive tables with about 100 million records each and I'm afraid I needed to perform an Inner Join between the two. Now, both tables are very simple; here's the description:
BioEntity table:
EGM table (an auxiliar table, in fact, resulting of bulk import operations):
I need to get a matching Name in order to associate BioEntityId with the PId residing in the EGM table. Originally, I tried to do everything with a single inner join but the query appeared to be taking way too long and the logfile of the database (in simple recovery mode) managed to chew up all the available disk space (that's just over 200 GB, when the database occupies 18GB) and the query would fail after waiting for two days, If I'm not mistaken. I managed to keep the log from growing (only 33 MB now) but the query has been running non-stop for 6 days now and it doesn't look like it's gonna stop anytime soon.
I'm running it on a fairly decent computer (4GB RAM, Core 2 Duo (E8400) 3GHz, Windows Server 2008, SQL Server 2008) and I've noticed that the computer jams occasionally every 30 seconds (give or take) for a couple of seconds. This makes it quite hard to use it for anything else, which is really getting on my nerves.
Now, here's the query:
SELECT EGM.Name, BioEntity.BioEntityId INTO AUX FROM EGM INNER JOIN BioEntity ON EGM.name LIKE BioEntity.Name AND EGM.TypeId = BioEntity.TypeId
I had manually setup some indexes; both EGM and BioEntity had a non-clustered covering index containing TypeId and Name. However, the query ran for five days and it did not end either, so I tried running Database Tuning Advisor to get the thing to work. It suggested deleting my older indexes and creating statistics and two clustered indexes instead (one on each table, just containing the TypeId which I find rather odd - or just plain dumb - but I gave it a go anyway).
It has been running for 6 days now and I'm still not sure what to do... Any ideas guys? How can I make this faster (or, at least, finite)?
Update: - Ok, I've canceled the query and rebooted the server to get the OS up and running again - I'm rerunning the workflow with your proposed changes, specifically cropping the nvarchar field to a much smaller size and swapping "like" for "=". This is gonna take at least two hours, so I'll be posting further updates later on
Update 2 (1PM GMT time, 18/11/09): - The estimated execution plan reveals a 67% cost regarding table scans followed by a 33% hash match. Next comes 0% parallelism (isn't this strange? This is the first time I'm using the estimated execution plan but this particular fact just lifted my eyebrow), 0% hash match, more 0% parallelism, 0% top, 0% table insert and finally another 0% select into. Seems the indexes are crap, as expected, so I'll be making manual indexes and discard the crappy suggested ones.
Relational algebra is the most common way of writing a query and also the most natural way to do so. The code is clean, easy to troubleshoot, and unsurprisingly, it is also the most efficient way to join two tables.
In this case the two tables are joined using the relationship table1.id = table2.id . It is possible to use multiple join statements together to join more than one table at the same time. To do that you add a second INNER JOIN statement and a second ON statement to indicate the third table and the second relationship.
I'm not an SQL tuning expert, but joining hundreds of millions of rows on a VARCHAR field doesn't sound like a good idea in any database system I know.
You could try adding an integer column to each table and computing a hash on the NAME field that should get the possible matches to a reasonable number before the engine has to look at the actual VARCHAR data.
For huge joins, sometimes explicitly choosing a loop join
speeds things up:
SELECT EGM.Name, BioEntity.BioEntityId INTO AUX FROM EGM INNER LOOP JOIN BioEntity ON EGM.name LIKE BioEntity.Name AND EGM.TypeId = BioEntity.TypeId
As always, posting your estimated execution plan could help us provide better answers.
EDIT: If both inputs are sorted (they should be, with the covering index), you can try a MERGE JOIN:
SELECT EGM.Name, BioEntity.BioEntityId INTO AUX FROM EGM INNER JOIN BioEntity ON EGM.name LIKE BioEntity.Name AND EGM.TypeId = BioEntity.TypeId OPTION (MERGE JOIN)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With