I have a data frame of 205,000+ rows formatted as follows:
df <- data.frame(project.id = c('SP001', 'SP001', 'SP001', 'SP017', 'SP018', 'SP017'),
supplier.id = c('1224', '5542', '7741', '1224', '2020', '9122'))
In the actual data frame there are 6700+ unique values of project.id
. I would like to create an edge list that pairs suppliers who have worked on the same project.
Desired end result for project.id = SP001
:
to from
1224 5542
1224 7741
5542 7741
So far I've tried using split
to create a list by project.id and then running lapply+combn
to generate all possible combinations of supplier.id
within each list/group:
try.list <- split(df, df$project.id)
try.output <- lapply(try.list, function(x) combn(x$supplier.id, 2))
Is there a more elegant/efficient (read "computed in less than 2hrs") way to generate something like this?
Any help would be much appreciated
Instead of using split
and lapply
, you can use the dplyr
package.
df <- data.frame(project.id = c('SP001', 'SP001', 'SP001', 'SP017', 'SP018', 'SP017'),
supplier.id = c('1224', '5542', '7741', '1224', '2020', '9122'),
stringsAsFactors = FALSE)
library(dplyr)
df %>% group_by(project.id) %>%
filter(n()>=2) %>% group_by(project.id) %>%
do(data.frame(t(combn(.$supplier.id, 2)), stringsAsFactors=FALSE))
# Source: local data frame [4 x 3]
# Groups: project.id [2]
# project.id X1 X2
# (chr) (chr) (chr)
# 1 SP001 1224 5542
# 2 SP001 1224 7741
# 3 SP001 5542 7741
# 4 SP017 1224 9122
You can just merge it with itself which gets you all the Cartesian pairs:
temp <- merge(df,df, by="project.id")
res <- temp[ temp$supplier.id.x != temp$supplier.id.y , ]
> res
project.id supplier.id.x supplier.id.y
2 SP001 1224 5542
3 SP001 1224 7741
4 SP001 5542 1224
6 SP001 5542 7741
7 SP001 7741 1224
8 SP001 7741 5542
11 SP017 1224 9122
12 SP017 9122 1224
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With