SQL Alternative to performing an INNER JOIN on a single table

Tags:

I have a large table (TokenFrequency) which has millions of rows in it. The TokenFrequency table that is structured like this:

Table - TokenFrequency

id - int, primary key
source - int, foreign key
token - char
count - int

My goal is to select all of the rows in which two sources have the same token in it. For example if my table looked like this:

id --- source --- token --- count
1 ------ 1 --------- dog ------- 1
2 ------ 2 --------- cat -------- 2
3 ------ 3 --------- cat -------- 2
4 ------ 4 --------- pig -------- 5
5 ------ 5 --------- zoo ------- 1
6 ------ 5 --------- cat -------- 1
7 ------ 5 --------- pig -------- 1

I would want a SQL query to give me source 1, source 2, and the sum of the counts. For example:

source1 --- source2 --- token --- count
---- 2 ----------- 3 --------- cat -------- 4
---- 2 ----------- 5 --------- cat -------- 3
---- 3 ----------- 5 --------- cat -------- 3
---- 4 ----------- 5 --------- pig -------- 6

I have a query that looks like this:

SELECT  F.source AS source1, S.source AS source2, F.token, 
       (F.count + S.count) AS sum 
FROM       TokenFrequency F 
INNER JOIN TokenFrequency S ON F.token = S.token 
WHERE F.source <> S.source

This query works fine but the problems that I have with it are that:

I have a TokenFrequency table that has millions of rows and therefore need a faster alternative to obtain this result.
The current query that I have is giving duplicates. For example its selecting:
source1=2, source2=3, token=cat, count=4
source1=3, source2=2, token=cat, count=4
Which isn't too much of a problem but if there is a way to elimate those and in turn obtain a speed increase then it would be very useful

The main issue that I have is speed of the query with my current query it takes hours to complete. The INNER JOIN on a table to itself is what I believe to be the problem. Im sure there has to be a way to eliminate the inner join and get similar results just using one instance of the TokenFrequency table. The second problem that I mentioned might also promote a speed increase in the query.

I need a way to restructure this query to provide the same results in a faster, more efficient manner.

Thanks.

921

asked Aug 07 '09 20:08

cruzja

2 Answers

I'd need a little more info to diagnose the speed issue, but to remove the dups, add this to the WHERE:

AND F.source<S.source

178

answered Sep 28 '22 07:09

KM.

Try this:

SELECT token, GROUP_CONCAT(source), SUM(count)
FROM TokenFrequency
GROUP BY token;

This should run a lot faster and also eliminate the duplicates. But the sources will be returned in a comma-separated list, so you'll have to explode that in your application.

You might also try creating a compound index over the columns token, source, count (in that order) and analyze with EXPLAIN to see if MySQL is smart enough to use it as a covering index for this query.

update: I seem to have misunderstood your question. You don't want the sum of counts per token, you want the sum of counts for every pair of sources for a given token.

I believe the inner join is the best solution for this. An important guideline for SQL is that if you need to calculate an expression with respect to two different rows, then you need to do a join.

However, one optimization technique that I mentioned above is to use a covering index so that all the columns you need are included in an index data structure. The benefit is that all your lookups are O(log n), and the query doesn't need to do a second I/O to read the physical row to get other columns.

In this case, you should create the covering index over columns token, source, count as I mentioned above. Also try to allocate enough cache space so that the index can be cached in memory.

answered Sep 28 '22 06:09

Bill Karwin

Related questions
                            
                                Updating a column from a varchar to jsonb
                            
                                sp_OAGetProperty returning NULL with OUT variable declared as MAX
                            
                                Where can I find usage statistics in Redshift?
                            
                                SQL equivalent for Pandas's [df.groupby(...)['col_name'].shift(1)]
                            
                                MySQL select from INT column
                            
                                Fill in gaps in data, using a value proportional to the gap distance to data from the surrounding rows?
                            
                                Abbreviation of Strings that Remains Unique
                            
                                Select rows until running sum reaches specific value
                            
                                add a column from a select query with index
                            
                                How to join comma separated column values with another table as rows
                            
                                Get date range gaps from a date set
                            
                                Oracle equivalent of information_schema.tables
                            
                                Condensing arrays in Presto
                            
                                Microsoft SQL Server - best way to 'Update if exists, or Insert'
                            
                                String list in SqlCommand through Parameters in C#
                            
                                Are all SQL Geospatial implementations database specific?
                            
                                Set Based Operations and calling Stored Procedures
                            
                                Joining multiple columns in one table to a single column in another table
                            
                                Optimize MySQL query to avoid "Using where; Using temporary; Using filesort"
                            
                                SQL Server schema best practices

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

SQL Alternative to performing an INNER JOIN on a single table

Tags:

performance

sql

inner-join

mysql

cruzja

People also ask

2 Answers

KM.

Bill Karwin

Recent Activity

Donate For Us