How to self join a data.table on a condition

Q: Can you Natural join a table with itself?

A natural join is a shorthand for joining two tables (or subqueries) on all columns that have the same name. A natural join of a table to itself could have several consequences. The most common would be the table itself -- if none of the values are NULL and the rows are unique.

Q: What is the joining condition?

A join condition defines the relationship between a physical schema entity object and itself (self-join) or the relationship between two entity objects: a Child table and a Parent table. A join can have one or more conditions. A join condition is synonymous with the ON clause in a SQL Join statement.

Q: What is self join When will you use and give example?

You use a self join when a table references data in itself. E.g., an Employee table may have a SupervisorID column that points to the employee that is the boss of the current employee.

Q: What are the best scenarios to use a self join?

Answer: The best example of self join in the real world is when we have a table with Employee data and each row contains information about employee and his/her manager. You can use self join in this scenario and retrieve relevant information.

Tags:

join

r

data.table

I want to add a new column to my data.table. This column should contain the sum of another column of all rows that satisfy a certain condition. An example: My data.table looks like this:

require(data.table)
DT <- data.table(n=c("a", "a", "a", "a", "a", "a", "b", "b", "b"),
             t=c(10, 20, 33, 40, 50, 22, 25, 34, 11),
             v=c(20, 15, 16, 17, 11, 12, 20, 22, 10)
             )
DT
   n  t  v
1: a 10 20
2: a 20 15
3: a 33 16
4: a 40 17
5: a 50 11
6: a 22 12
7: b 25 20
8: b 34 22
9: b 11 10

For every row x and every row i, where abs(t[i] - t[x]) <= 10, I want to calculate

foo = sum( v[i] * abs(t[i] - t[x]) )

In SQL I would solve this using a self join. In R I was able to do this using a for loop:

for (i in 1:nrow(DT))
    DT[i, foo:=DT[n==DT[i]$n & abs(t-DT[i]$t)<=10, sum(v * abs(t-DT[i]$t) )]]

DT
   n  t  v foo
1: a 10 20 150
2: a 20 15 224
3: a 33 16 119
4: a 40 17 222
5: a 50 11 170
6: a 22 12  30
7: b 25 20 198
8: b 34 22 180
9: b 11 10   0

Unfortunately I have to do this quite often and the table I work with is rather larger. The for-loop approach works but is too slow. I played around with the sqldf package, with no real breakthrough. I would love to do this using some data.table magic and there I need your help :-). I think what is needed is some kind of self join on the condition that the difference of the t values is smaller then the threshold.

Follow up: I have a follow up question: In my application this join is done over and over again. The v's change, but the t's and the n's are always the same. So I am thinking about somehow storing which rows belong together. Any ideas how to do this in a clever way?

893

asked Feb 20 '13 15:02

uuazed

1 Answers

Great question. This answer is just a taster really alongside Ricardo's answer.

Ideally we want to avoid the large cartesian self join for efficiency. Unfortunately range joins (FR#203) haven't been implemented yet. In the meantime, using very latest v1.8.7 (untested) :

setkey(DT,n,t)
DT[,from:=DT[.(n,t-10),which=TRUE,roll=-Inf,rollends=TRUE]]
DT[,to:=DT[.(n,t+10),which=TRUE,roll=+Inf,rollends=TRUE]]
DT[,foo:=0L]
for (i in 1:nrow(DT)) {
    s = seq.int(DT$from[i],DT$to[i])
    set(DT, i, "foo", DT[,sum(v[s]*abs(t[s]-t[i]))] )
}

Once FR#203 is done, the logic above would be built in, and it should become a one liner :

setkey(DT,n,t)
DT[.(n,.(t-10,t+10),t), foo:=sum(v*abs(t-i.t))]

The second column of the i table there is a 2-column column (indicating a between join). That should be fast because, as usual, j would be evaluated for each row of i without needing to create a huge cartesian self join table.

That's the current thinking, anyway.

148

answered Oct 21 '22 09:10

Matt Dowle

Related questions
                            
                                backticks in variable name
                            
                                Error when trying to use stl and decompose functions in R
                            
                                knitr: output hook with an output.lines= option that works like echo=2:6
                            
                                Why does merge result in more rows than original data?
                            
                                Cluster data in heat map in R ggplot
                            
                                How to add an interactive visualization to R markdown
                            
                                Preserve row/column labels from table() using kable and knitr
                            
                                dplyr 'rename' standard evaluation function not working as expected?
                            
                                How to underline text in a plot title or label? (ggplot2)
                            
                                Using dplyr to create summary proportion table with several categorical/factor variables
                            
                                How do we set constant variables while building R packages?
                            
                                Understanding num_classes for xgboost in R
                            
                                igraph: Resolving tight overlapping nodes
                            
                                How to embed local Video in R Markdown?
                            
                                Difference between Distinct vs Unique
                            
                                deparse(substitute()) returns function name normally, but function code when called inside for loop
                            
                                Formatting percentages in R-package openxlsx
                            
                                R error "Can't join on ... because of incompatible types"
                            
                                Call R scripts in Matlab
                            
                                Dependency management in R

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With