SQL: Inner joining two massive tables

Tags:

I have two massive tables with about 100 million records each and I'm afraid I needed to perform an Inner Join between the two. Now, both tables are very simple; here's the description:

BioEntity table:

BioEntityId (int)
Name (nvarchar 4000, although this is an overkill)
TypeId (int)

EGM table (an auxiliar table, in fact, resulting of bulk import operations):

EMGId (int)
PId (int)
Name (nvarchar 4000, although this is an overkill)
TypeId (int)
LastModified (date)

I need to get a matching Name in order to associate BioEntityId with the PId residing in the EGM table. Originally, I tried to do everything with a single inner join but the query appeared to be taking way too long and the logfile of the database (in simple recovery mode) managed to chew up all the available disk space (that's just over 200 GB, when the database occupies 18GB) and the query would fail after waiting for two days, If I'm not mistaken. I managed to keep the log from growing (only 33 MB now) but the query has been running non-stop for 6 days now and it doesn't look like it's gonna stop anytime soon.

I'm running it on a fairly decent computer (4GB RAM, Core 2 Duo (E8400) 3GHz, Windows Server 2008, SQL Server 2008) and I've noticed that the computer jams occasionally every 30 seconds (give or take) for a couple of seconds. This makes it quite hard to use it for anything else, which is really getting on my nerves.

Now, here's the query:

 SELECT EGM.Name, BioEntity.BioEntityId INTO AUX  FROM EGM INNER JOIN BioEntity   ON EGM.name LIKE BioEntity.Name AND EGM.TypeId = BioEntity.TypeId

I had manually setup some indexes; both EGM and BioEntity had a non-clustered covering index containing TypeId and Name. However, the query ran for five days and it did not end either, so I tried running Database Tuning Advisor to get the thing to work. It suggested deleting my older indexes and creating statistics and two clustered indexes instead (one on each table, just containing the TypeId which I find rather odd - or just plain dumb - but I gave it a go anyway).

It has been running for 6 days now and I'm still not sure what to do... Any ideas guys? How can I make this faster (or, at least, finite)?

Update: - Ok, I've canceled the query and rebooted the server to get the OS up and running again - I'm rerunning the workflow with your proposed changes, specifically cropping the nvarchar field to a much smaller size and swapping "like" for "=". This is gonna take at least two hours, so I'll be posting further updates later on

Update 2 (1PM GMT time, 18/11/09): - The estimated execution plan reveals a 67% cost regarding table scans followed by a 33% hash match. Next comes 0% parallelism (isn't this strange? This is the first time I'm using the estimated execution plan but this particular fact just lifted my eyebrow), 0% hash match, more 0% parallelism, 0% top, 0% table insert and finally another 0% select into. Seems the indexes are crap, as expected, so I'll be making manual indexes and discard the crappy suggested ones.

880

asked Nov 17 '09 16:11

João Pereira

2 Answers

I'm not an SQL tuning expert, but joining hundreds of millions of rows on a VARCHAR field doesn't sound like a good idea in any database system I know.

You could try adding an integer column to each table and computing a hash on the NAME field that should get the possible matches to a reasonable number before the engine has to look at the actual VARCHAR data.

200

answered Oct 13 '22 20:10

Larry Lustig

For huge joins, sometimes explicitly choosing a loop join speeds things up:

SELECT EGM.Name, BioEntity.BioEntityId INTO AUX FROM EGM  INNER LOOP JOIN BioEntity      ON EGM.name LIKE BioEntity.Name AND EGM.TypeId = BioEntity.TypeId

As always, posting your estimated execution plan could help us provide better answers.

EDIT: If both inputs are sorted (they should be, with the covering index), you can try a MERGE JOIN:

SELECT EGM.Name, BioEntity.BioEntityId INTO AUX FROM EGM  INNER JOIN BioEntity      ON EGM.name LIKE BioEntity.Name AND EGM.TypeId = BioEntity.TypeId OPTION (MERGE JOIN)

answered Oct 13 '22 20:10

Andomar

Related questions
                            
                                Laravel join with 3 Tables
                            
                                NonUniqueDiscoveredSqlAliasException: Encountered a duplicated sql alias [ASC_ID] during auto-discovery of a native-sql query
                            
                                How to debug a T-SQL trigger?
                            
                                Could somebody explain SOLR requestHandlers and responseWriters in detail?
                            
                                Solution for: Store update, insert, or delete statement affected an unexpected number of rows (0) [closed]
                            
                                SQL Server: Does 'DROP TABLE' inside transaction causes an implicit commit?
                            
                                How to write a constraint concerning a max number of rows in postgresql?
                            
                                Oracle subquery does not see the variable from the outer block 2 levels up
                            
                                how to group by and return sum row in Postgres
                            
                                sql server - check to see if cast is possible
                            
                                SQL Server *= Operator?
                            
                                MySQL Wildcards * and %
                            
                                how do i get mysql rows from 24-48 hours ago?
                            
                                How do you get values from all columns using ResultSet.getBinaryStream() in jdbc?
                            
                                Is there any way to use SCOPE_IDENTITY if using a multiple insert statement?
                            
                                what is the meaning of Kanatype Sensitive KS and width sensitive
                            
                                How to add multiple columns to a table and add default constraint on one of them?
                            
                                Rails 4 Accessing Join Table Attributes
                            
                                Oracle PL/SQL - Are NO_DATA_FOUND Exceptions bad for stored procedure performance?
                            
                                update multiple rows using limit in mysql?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

SQL: Inner joining two massive tables

Tags:

sql

inner-join

sql-server

sql-server-2008

query-optimization

João Pereira

People also ask

2 Answers

Larry Lustig

Andomar

Recent Activity

Donate For Us