I have a data set that looks something like this: <pre class="prettyprint"><code> id1 id2 size 1 5400 5505 7 2 5033 5458 1 3 5452 2873 24 4 5452 5213 2 5 5452 4242 26 6 4823 4823 4 7 5505 5400 11 </code></pre> Where <code>id1</code> and <code>id2</code> are unique nodes in a graph, and <code>size</code> is a value assigned to the directed edge connecting them from <code>id1</code> to <code>id2</code>. This data set is fairly large (a little over 2 million rows). What I would like to do is sum the size column, grouped by unordered node pairs of <code>id1</code> and <code>id2</code>. For example, in the first row, we have <code>id1=5400</code> and <code>id2=5505</code>. There exists another row in the data frame where <code>id1=5505</code> and <code>id2=5400</code>. In the grouped data, the sum of the size columns for these two rows would be added to a single row. So in other words I want to summarize the data where I'm grouping on an (unordered) set of (id1,id2). I've found a way to do this using <code>apply</code> with a custom function that checks for the reversed column pair in the full data set, but this works excruciatingly slow. Does anyone know of a way to do this another way, perhaps with <code>plyr</code> or with something in the base packages that would be more efficient?

an alternate method: <pre class="prettyprint"><code>R> library(igraph) R> DF id1 id2 size 1 5400 5505 7 2 5033 5458 1 3 5452 2873 24 4 5452 5213 2 5 5452 4242 26 6 4823 4823 4 7 5505 5400 11 R> g <- graph.data.frame(DF, directed=F) R> g <- simplify(g, edge.attr.comb="sum", remove.loops=FALSE) R> DF <- get.data.frame(g) R> DF id1 id2 size 1 5400 5505 18 2 5033 5458 1 3 5452 2873 24 4 5452 5213 2 5 5452 4242 26 6 4823 4823 4 </code></pre>

Aggregate a data frame based on unordered pairs of columns

Tags:

r

aggregate

plyr

I have a data set that looks something like this:

     id1  id2   size
1   5400 5505      7
2   5033 5458      1
3   5452 2873     24
4   5452 5213      2
5   5452 4242     26
6   4823 4823      4
7   5505 5400     11

Where id1 and id2 are unique nodes in a graph, and size is a value assigned to the directed edge connecting them from id1 to id2. This data set is fairly large (a little over 2 million rows). What I would like to do is sum the size column, grouped by unordered node pairs of id1 and id2. For example, in the first row, we have id1=5400 and id2=5505. There exists another row in the data frame where id1=5505 and id2=5400. In the grouped data, the sum of the size columns for these two rows would be added to a single row. So in other words I want to summarize the data where I'm grouping on an (unordered) set of (id1,id2). I've found a way to do this using apply with a custom function that checks for the reversed column pair in the full data set, but this works excruciatingly slow. Does anyone know of a way to do this another way, perhaps with plyr or with something in the base packages that would be more efficient?

323

asked Mar 18 '13 21:03

R_User

2 Answers

an alternate method:

R> library(igraph)
R> DF
   id1  id2 size
1 5400 5505    7
2 5033 5458    1
3 5452 2873   24
4 5452 5213    2
5 5452 4242   26
6 4823 4823    4
7 5505 5400   11
R> g  <- graph.data.frame(DF, directed=F)
R> g  <- simplify(g, edge.attr.comb="sum", remove.loops=FALSE)
R> DF <- get.data.frame(g)
R> DF
   id1  id2 size
1 5400 5505   18
2 5033 5458    1
3 5452 2873   24
4 5452 5213    2
5 5452 4242   26
6 4823 4823    4

153

answered Oct 08 '22 03:10

margaret

One way is to create extra columns with pmax and pmin of id1 and id2as follows. I'll use data.table solution here.

require(data.table)
DT <- data.table(DF)
# Following mnel's suggestion, g1, g2 could be used directly in by
# and it could be even shortened by using `id1` and id2` as their names
DT.OUT <- DT[, list(size=sum(size)), 
        by=list(id1 = pmin(id1, id2), id2 = pmax(id1, id2))]
#     id1  id2 size
# 1: 5400 5505   18
# 2: 5033 5458    1
# 3: 5452 2873   24
# 4: 5452 5213    2
# 5: 5452 4242   26
# 6: 4823 4823    4

answered Oct 08 '22 03:10

Arun

Related questions
                            
                                How to loop over a tidy eval function using purrr?
                            
                                How to add metadata to a tibble
                            
                                How to find out which package was installed from GitHub in my R library?
                            
                                How to avoid that anytime(<numeric>) "updates by reference"?
                            
                                Safer purrr::map2 for lists with names out of order
                            
                                How to get ride of polygon borders using geom_sf in ggplot2
                            
                                How do I use tidyselect "where" in a custom package?
                            
                                What is the difference between . and .data?
                            
                                jitter if multiple outliers in ggplot2 boxplot
                            
                                mapping over the rows of a data frame
                            
                                Sort a list of nontrivial elements in R
                            
                                How can I read a date series of quarterly data into R?
                            
                                Two Color Scales for geom_line in ggplot2
                            
                                Removing Two Characters From A String
                            
                                How to subset data.frame by weeks and then sum?
                            
                                Find out if column in R table includes duplicate values?
                            
                                Number values include comma -- how do I make these numeric? [duplicate]
                            
                                Problems executing script from command line in R. Error message: cannot find path specified
                            
                                How to multiply a single column in a data.frame by a number
                            
                                Include text control characters in plotmath expressions

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With