I have a data set that looks something like this:
id1 id2 size
1 5400 5505 7
2 5033 5458 1
3 5452 2873 24
4 5452 5213 2
5 5452 4242 26
6 4823 4823 4
7 5505 5400 11
Where id1
and id2
are unique nodes in a graph, and size
is a value assigned to the directed edge connecting them from id1
to id2
. This data set is fairly large (a little over 2 million rows). What I would like to do is sum the size column, grouped by unordered node pairs of id1
and id2
. For example, in the first row, we have id1=5400
and id2=5505
. There exists another row in the data frame where id1=5505
and id2=5400
. In the grouped data, the sum of the size columns for these two rows would be added to a single row. So in other words I want to summarize the data where I'm grouping on an (unordered) set of (id1,id2). I've found a way to do this using apply
with a custom function that checks for the reversed column pair in the full data set, but this works excruciatingly slow. Does anyone know of a way to do this another way, perhaps with plyr
or with something in the base packages that would be more efficient?
Grouping can be also done using multiple columns belonging to the data frame for this just the names of the columns have to be passed to the function.
To pick out single or multiple columns use the select() function. The select() function expects a dataframe as it's first input ('argument', in R language), followed by the names of the columns you want to extract with a comma between each name.
There is a function in R that you can use (called the sort function) to sort your data in either ascending or descending order. The variable by which sort you can be a numeric, string or factor variable. You also have some options on how missing values will be handled: they can be listed first, last or removed.
an alternate method:
R> library(igraph)
R> DF
id1 id2 size
1 5400 5505 7
2 5033 5458 1
3 5452 2873 24
4 5452 5213 2
5 5452 4242 26
6 4823 4823 4
7 5505 5400 11
R> g <- graph.data.frame(DF, directed=F)
R> g <- simplify(g, edge.attr.comb="sum", remove.loops=FALSE)
R> DF <- get.data.frame(g)
R> DF
id1 id2 size
1 5400 5505 18
2 5033 5458 1
3 5452 2873 24
4 5452 5213 2
5 5452 4242 26
6 4823 4823 4
One way is to create extra columns with pmax
and pmin
of id1
and id2
as follows. I'll use data.table
solution here.
require(data.table)
DT <- data.table(DF)
# Following mnel's suggestion, g1, g2 could be used directly in by
# and it could be even shortened by using `id1` and id2` as their names
DT.OUT <- DT[, list(size=sum(size)),
by=list(id1 = pmin(id1, id2), id2 = pmax(id1, id2))]
# id1 id2 size
# 1: 5400 5505 18
# 2: 5033 5458 1
# 3: 5452 2873 24
# 4: 5452 5213 2
# 5: 5452 4242 26
# 6: 4823 4823 4
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With