How can I achieve a cross join in R ? I know that "merge" can do inner join, outer join. But I do not know how to achieve a cross join in R.
Thanks
What is a Cross Join In R? An cross join is a merge operation between two data frame which seeks to return every matching records in the two data frames, including ones that lack a matching pair on the other side. Cross joins have a bad reputation among developers.
Merge data frames in R. The R merge function allows merging two data frames by common columns or by row names. This function allows you to perform different database (SQL) joins, like left join, inner join, right join or full join, among others.
In R we use merge() function to merge two dataframes in R. This function is present inside join() function of dplyr package. The most important condition for joining two dataframes is that the column type should be the same on which the merging happens. merge() function works similarly like join in DBMS.
If speed is an issue, I suggest checking out the excellent data.table
package. In the example at the end it's ~90x faster than merge
.
You didn't provide example data. If you just want to get all combinations of two (or more individual) columns, you can use CJ
(cross join):
library(data.table) CJ(x=1:2,y=letters[1:3]) # x y #1: 1 a #2: 1 b #3: 1 c #4: 2 a #5: 2 b #6: 2 c
If you want to do a cross join on two tables, I haven't found a way to use CJ(). But you can still use data.table
:
x2<-data.table(id1=letters[1:3],vals1=1:3) y2<-data.table(id2=letters[4:7],vals2=4:7) res<-setkey(x2[,c(k=1,.SD)],k)[y2[,c(k=1,.SD)],allow.cartesian=TRUE][,k:=NULL] res # id1 vals1 id2 vals2 # 1: a 1 d 4 # 2: b 2 d 4 # 3: c 3 d 4 # 4: a 1 e 5 # 5: b 2 e 5 # 6: c 3 e 5 # 7: a 1 f 6 # 8: b 2 f 6 # 9: c 3 f 6 #10: a 1 g 7 #11: b 2 g 7 #12: c 3 g 7
Explanation of the res
line:
setkey(tablename,keycolumns)
), add the dummy column to the other table, and then join them.c(k=1,.SD)
part is one way that I have found to add columns at the beginning (the default is to add them to the end).X[Y]
. The X in this case is setkey(x2[,c(k=1,.SD)],k)
, and the Y is y2[,c(k=1,.SD)]
.allow.cartesian=TRUE
tells data.table
to ignore the duplicate key values, and perform a cartesian join (prior versions didn't require this)[,k:=NULL]
at the end just removes the dummy key from the result.You can also turn this into a function, so it's cleaner to use:
# Version 1; easier to write: CJ.table.1 <- function(X,Y) setkey(X[,c(k=1,.SD)],k)[Y[,c(k=1,.SD)],allow.cartesian=TRUE][,k:=NULL] CJ.table.1(x2,y2) # id1 vals1 id2 vals2 # 1: a 1 d 4 # 2: b 2 d 4 # 3: c 3 d 4 # 4: a 1 e 5 # 5: b 2 e 5 # 6: c 3 e 5 # 7: a 1 f 6 # 8: b 2 f 6 # 9: c 3 f 6 #10: a 1 g 7 #11: b 2 g 7 #12: c 3 g 7 # Version 2; faster but messier: CJ.table.2 <- function(X,Y) { eval(parse(text=paste0("setkey(X[,c(k=1,.SD)],k)[Y[,c(k=1,.SD)],list(",paste0(unique(c(names(X),names(Y))),collapse=","),")][,k:=NULL]"))) }
Here are some speed benchmarks:
# Create a bigger (but still very small) example: n<-1e3 x3<-data.table(id1=1L:n,vals1=sample(letters,n,replace=T)) y3<-data.table(id2=1L:n,vals2=sample(LETTERS,n,replace=T)) library(microbenchmark) microbenchmark(merge=merge.data.frame(x3,y3,all=TRUE), CJ.table.1=CJ.table.1(x3,y3), CJ.table.2=CJ.table.2(x3,y3), times=3, unit="s") #Unit: seconds # expr min lq median uq max neval # merge 4.03710225 4.23233688 4.42757152 5.57854711 6.72952271 3 # CJ.table.1 0.06227603 0.06264222 0.06300842 0.06701880 0.07102917 3 # CJ.table.2 0.04740142 0.04812997 0.04885853 0.05433146 0.05980440 3
Note that these data.table
methods are much faster than the merge
method suggested by @danas.zuokas. The two tables with 1,000 rows in this example result in a cross-joined table with 1 million rows. So even if your original tables are small, the result can get big quickly and speed becomes important.
Lastly, recent versions of data.table
require you to add the allow.cartesian=TRUE
(as in CJ.table.1) or specify the names of the columns that should be returned (CJ.table.2). The second method (CJ.table.2) seems to be faster, but requires some more complicated code if you want to automatically specify all the column names. And it may not work with duplicate column names. (Feel free to suggest a simpler version of CJ.table.2)
Is it just all=TRUE
?
x<-data.frame(id1=c("a","b","c"),vals1=1:3)
y<-data.frame(id2=c("d","e","f"),vals2=4:6)
merge(x,y,all=TRUE)
From documentation of merge
:
If by or both by.x and by.y are of length 0 (a length zero vector or NULL), the result, r, is the Cartesian product of x and y, i.e., dim(r) = c(nrow(x)*nrow(y), ncol(x) + ncol(y)).
This was asked years ago, but you can use tidyr::crossing()
to do a cross-join. Definitely the simplest solution of the bunch.
library(tidyr)
league <- c("MLB", "NHL", "NFL", "NBA")
season <- c("2018", "2017")
tidyr::crossing(league, season)
#> # A tibble: 8 x 2
#> league season
#> <chr> <chr>
#> 1 MLB 2017
#> 2 MLB 2018
#> 3 NBA 2017
#> 4 NBA 2018
#> 5 NFL 2017
#> 6 NFL 2018
#> 7 NHL 2017
#> 8 NHL 2018
Created on 2018-12-08 by the reprex package (v0.2.0).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With