How can I achieve a cross join in R ? I know that "merge" can do inner join, outer join. But I do not know how to achieve a cross join in R. Thanks

If speed is an issue, I suggest checking out the excellent <code>data.table</code> package. In the example at the end it's ~90x faster than <code>merge</code>. You didn't provide example data. If you just want to get all combinations of two (or more individual) columns, you can use <code>CJ</code> (cross join): <pre class="prettyprint"><code>library(data.table) CJ(x=1:2,y=letters[1:3]) # x y #1: 1 a #2: 1 b #3: 1 c #4: 2 a #5: 2 b #6: 2 c </code></pre> If you want to do a cross join on two tables, I haven't found a way to use CJ(). But you can still use <code>data.table</code>: <pre class="prettyprint"><code>x2<-data.table(id1=letters[1:3],vals1=1:3) y2<-data.table(id2=letters[4:7],vals2=4:7) res<-setkey(x2[,c(k=1,.SD)],k)[y2[,c(k=1,.SD)],allow.cartesian=TRUE][,k:=NULL] res # id1 vals1 id2 vals2 # 1: a 1 d 4 # 2: b 2 d 4 # 3: c 3 d 4 # 4: a 1 e 5 # 5: b 2 e 5 # 6: c 3 e 5 # 7: a 1 f 6 # 8: b 2 f 6 # 9: c 3 f 6 #10: a 1 g 7 #11: b 2 g 7 #12: c 3 g 7 </code></pre> Explanation of the <code>res</code> line: <ul> <li>Basically you add a dummy column (k in this example) to one table and set it as the key (<code>setkey(tablename,keycolumns)</code>), add the dummy column to the other table, and then join them.</li> <li>The data.table structure uses column positions and not names in the join, so you have to put the dummy column at the beginning. The <code>c(k=1,.SD)</code> part is one way that I have found to add columns at the beginning (the default is to add them to the end).</li> <li>A standard data.table join has a format of <code>X[Y]</code>. The X in this case is <code>setkey(x2[,c(k=1,.SD)],k)</code>, and the Y is <code>y2[,c(k=1,.SD)]</code>.</li> <li> <code>allow.cartesian=TRUE</code> tells <code>data.table</code> to ignore the duplicate key values, and perform a cartesian join (prior versions didn't require this)</li> <li>The <code>[,k:=NULL]</code> at the end just removes the dummy key from the result.</li> </ul> You can also turn this into a function, so it's cleaner to use: <pre class="prettyprint"><code># Version 1; easier to write: CJ.table.1 <- function(X,Y) setkey(X[,c(k=1,.SD)],k)[Y[,c(k=1,.SD)],allow.cartesian=TRUE][,k:=NULL] CJ.table.1(x2,y2) # id1 vals1 id2 vals2 # 1: a 1 d 4 # 2: b 2 d 4 # 3: c 3 d 4 # 4: a 1 e 5 # 5: b 2 e 5 # 6: c 3 e 5 # 7: a 1 f 6 # 8: b 2 f 6 # 9: c 3 f 6 #10: a 1 g 7 #11: b 2 g 7 #12: c 3 g 7 # Version 2; faster but messier: CJ.table.2 <- function(X,Y) { eval(parse(text=paste0("setkey(X[,c(k=1,.SD)],k)[Y[,c(k=1,.SD)],list(",paste0(unique(c(names(X),names(Y))),collapse=","),")][,k:=NULL]"))) } </code></pre> Here are some speed benchmarks: <pre class="prettyprint"><code># Create a bigger (but still very small) example: n<-1e3 x3<-data.table(id1=1L:n,vals1=sample(letters,n,replace=T)) y3<-data.table(id2=1L:n,vals2=sample(LETTERS,n,replace=T)) library(microbenchmark) microbenchmark(merge=merge.data.frame(x3,y3,all=TRUE), CJ.table.1=CJ.table.1(x3,y3), CJ.table.2=CJ.table.2(x3,y3), times=3, unit="s") #Unit: seconds # expr min lq median uq max neval # merge 4.03710225 4.23233688 4.42757152 5.57854711 6.72952271 3 # CJ.table.1 0.06227603 0.06264222 0.06300842 0.06701880 0.07102917 3 # CJ.table.2 0.04740142 0.04812997 0.04885853 0.05433146 0.05980440 3 </code></pre> Note that these <code>data.table</code> methods are much faster than the <code>merge</code> method suggested by @danas.zuokas. The two tables with 1,000 rows in this example result in a cross-joined table with 1 million rows. So even if your original tables are small, the result can get big quickly and speed becomes important. Lastly, recent versions of <code>data.table</code> require you to add the <code>allow.cartesian=TRUE</code> (as in CJ.table.1) or specify the names of the columns that should be returned (CJ.table.2). The second method (CJ.table.2) seems to be faster, but requires some more complicated code if you want to automatically specify all the column names. And it may not work with duplicate column names. (Feel free to suggest a simpler version of CJ.table.2)

Is it just <code>all=TRUE</code>? <pre class="prettyprint"><code>x<-data.frame(id1=c("a","b","c"),vals1=1:3) y<-data.frame(id2=c("d","e","f"),vals2=4:6) merge(x,y,all=TRUE) </code></pre> From documentation of <code>merge</code>: <blockquote> If by or both by.x and by.y are of length 0 (a length zero vector or NULL), the result, r, is the Cartesian product of x and y, i.e., dim(r) = c(nrow(x)*nrow(y), ncol(x) + ncol(y)). </blockquote>

This was asked years ago, but you can use <code>tidyr::crossing()</code> to do a cross-join. Definitely the simplest solution of the bunch. <pre class="prettyprint lang-r prettyprint-override"><code>library(tidyr) league <- c("MLB", "NHL", "NFL", "NBA") season <- c("2018", "2017") tidyr::crossing(league, season) #> # A tibble: 8 x 2 #> league season #> <chr> <chr> #> 1 MLB 2017 #> 2 MLB 2018 #> 3 NBA 2017 #> 4 NBA 2018 #> 5 NFL 2017 #> 6 NFL 2018 #> 7 NHL 2017 #> 8 NHL 2018 </code></pre> Created on 2018-12-08 by the reprex package (v0.2.0).

How to do cross join in R?

3 Answers

If speed is an issue, I suggest checking out the excellent data.table package. In the example at the end it's ~90x faster than merge.

You didn't provide example data. If you just want to get all combinations of two (or more individual) columns, you can use CJ (cross join):

library(data.table) CJ(x=1:2,y=letters[1:3]) #   x y #1: 1 a #2: 1 b #3: 1 c #4: 2 a #5: 2 b #6: 2 c

If you want to do a cross join on two tables, I haven't found a way to use CJ(). But you can still use data.table:

x2<-data.table(id1=letters[1:3],vals1=1:3) y2<-data.table(id2=letters[4:7],vals2=4:7)  res<-setkey(x2[,c(k=1,.SD)],k)[y2[,c(k=1,.SD)],allow.cartesian=TRUE][,k:=NULL] res #    id1 vals1 id2 vals2 # 1:   a     1   d     4 # 2:   b     2   d     4 # 3:   c     3   d     4 # 4:   a     1   e     5 # 5:   b     2   e     5 # 6:   c     3   e     5 # 7:   a     1   f     6 # 8:   b     2   f     6 # 9:   c     3   f     6 #10:   a     1   g     7 #11:   b     2   g     7 #12:   c     3   g     7

Explanation of the res line:

Basically you add a dummy column (k in this example) to one table and set it as the key (setkey(tablename,keycolumns)), add the dummy column to the other table, and then join them.
The data.table structure uses column positions and not names in the join, so you have to put the dummy column at the beginning. The c(k=1,.SD) part is one way that I have found to add columns at the beginning (the default is to add them to the end).
A standard data.table join has a format of X[Y]. The X in this case is setkey(x2[,c(k=1,.SD)],k), and the Y is y2[,c(k=1,.SD)].
allow.cartesian=TRUE tells data.table to ignore the duplicate key values, and perform a cartesian join (prior versions didn't require this)
The [,k:=NULL] at the end just removes the dummy key from the result.

You can also turn this into a function, so it's cleaner to use:

# Version 1; easier to write: CJ.table.1 <- function(X,Y)   setkey(X[,c(k=1,.SD)],k)[Y[,c(k=1,.SD)],allow.cartesian=TRUE][,k:=NULL]  CJ.table.1(x2,y2) #    id1 vals1 id2 vals2 # 1:   a     1   d     4 # 2:   b     2   d     4 # 3:   c     3   d     4 # 4:   a     1   e     5 # 5:   b     2   e     5 # 6:   c     3   e     5 # 7:   a     1   f     6 # 8:   b     2   f     6 # 9:   c     3   f     6 #10:   a     1   g     7 #11:   b     2   g     7 #12:   c     3   g     7  # Version 2; faster but messier: CJ.table.2 <- function(X,Y) {   eval(parse(text=paste0("setkey(X[,c(k=1,.SD)],k)[Y[,c(k=1,.SD)],list(",paste0(unique(c(names(X),names(Y))),collapse=","),")][,k:=NULL]"))) }

Here are some speed benchmarks:

# Create a bigger (but still very small) example: n<-1e3 x3<-data.table(id1=1L:n,vals1=sample(letters,n,replace=T)) y3<-data.table(id2=1L:n,vals2=sample(LETTERS,n,replace=T))  library(microbenchmark) microbenchmark(merge=merge.data.frame(x3,y3,all=TRUE),                CJ.table.1=CJ.table.1(x3,y3),                CJ.table.2=CJ.table.2(x3,y3),                times=3, unit="s") #Unit: seconds #       expr        min         lq     median         uq        max neval #      merge 4.03710225 4.23233688 4.42757152 5.57854711 6.72952271     3 # CJ.table.1 0.06227603 0.06264222 0.06300842 0.06701880 0.07102917     3 # CJ.table.2 0.04740142 0.04812997 0.04885853 0.05433146 0.05980440     3

Note that these data.table methods are much faster than the merge method suggested by @danas.zuokas. The two tables with 1,000 rows in this example result in a cross-joined table with 1 million rows. So even if your original tables are small, the result can get big quickly and speed becomes important.

Lastly, recent versions of data.table require you to add the allow.cartesian=TRUE (as in CJ.table.1) or specify the names of the columns that should be returned (CJ.table.2). The second method (CJ.table.2) seems to be faster, but requires some more complicated code if you want to automatically specify all the column names. And it may not work with duplicate column names. (Feel free to suggest a simpler version of CJ.table.2)

answered Nov 07 '22 14:11

dnlbrky

Is it just all=TRUE?

x<-data.frame(id1=c("a","b","c"),vals1=1:3)
y<-data.frame(id2=c("d","e","f"),vals2=4:6)
merge(x,y,all=TRUE)

From documentation of merge:

If by or both by.x and by.y are of length 0 (a length zero vector or NULL), the result, r, is the Cartesian product of x and y, i.e., dim(r) = c(nrow(x)*nrow(y), ncol(x) + ncol(y)).

answered Nov 07 '22 14:11

danas.zuokas

This was asked years ago, but you can use tidyr::crossing() to do a cross-join. Definitely the simplest solution of the bunch.

library(tidyr)

league <- c("MLB", "NHL", "NFL", "NBA")
season <- c("2018", "2017")

tidyr::crossing(league, season)
#> # A tibble: 8 x 2
#>   league season
#>   <chr>  <chr> 
#> 1 MLB    2017  
#> 2 MLB    2018  
#> 3 NBA    2017  
#> 4 NBA    2018  
#> 5 NFL    2017  
#> 6 NFL    2018  
#> 7 NHL    2017  
#> 8 NHL    2018

Created on 2018-12-08 by the reprex package (v0.2.0).

answered Nov 07 '22 15:11

Evan O.

Related questions
                            
                                Efficiently sum across multiple columns in R
                            
                                Concatenate row-wise across specific columns of dataframe
                            
                                Transposing a dataframe maintaining the first column as heading
                            
                                What's the difference between substitute and quote in R
                            
                                How do I install an R package from the source tarball on windows?
                            
                                subtract a constant vector from each row in a matrix in r
                            
                                Remove duplicates keeping entry with largest absolute value
                            
                                Using different font styles in annotate (ggplot2)
                            
                                How can I change XTS to data.frame and keep Index?
                            
                                How do you read multiple .txt files into R? [duplicate]
                            
                                How to split a string into substrings of a given length? [duplicate]
                            
                                Remove square brackets from a string vector
                            
                                Changing date format in R
                            
                                What does the diff() function in R do? [closed]
                            
                                Create categories by comparing a numeric column with a fixed value
                            
                                Dealing with TRUE, FALSE, NA and NaN
                            
                                R how can I calculate difference between rows in a data frame
                            
                                Removing Whitespace From a Whole Data Frame in R
                            
                                Remove columns from dataframe where some of values are NA
                            
                                Count NAs per row in dataframe [duplicate]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to do cross join in R?

Tags:

r

cross-join

zjffdu

People also ask

3 Answers

dnlbrky

danas.zuokas

Evan O.

Recent Activity

Donate For Us