Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

In an R dataframe, how do I broadcast columns corresponding to dimensions?

I have an R dataframe:

# here just define it directly, but it comes from a simulation
simPrice <- data.frame(simId=c(1,1,2,2), 
                       crop=rep(c('apple','pear'),2), 
                       mean=rep(c(10,22),2), 
                       sd=rep(c(2,4),2), 
                       price=c(9,21,12,18))

    simId   crop mean sd price
  1     1  apple   10  2     9
  2     1   pear   22  4    21
  3     2  apple   10  2    12
  4     2   pear   22  4    18

This is the price of fruit (apples and pears) in two different iterations of a simulation. In general, I may have any number of fruit or iterations. Crucially, I may also have other columns (e.g. varieties, date sold, location sold, etc).

I have another dataframe giving the volume of fruit grown at a number of farms:

# here just define it directly, but it comes from a simulation
simVol  <- data.frame(simId=c(1,1,1,1,2,2,2,2), 
                      farm=rep(c('farm A', 'farm A', 'farm B', 'farm B'),2),
                      crop=rep(c('apple','pear'),4), 
                      mean=rep(c(10,22),4), 
                      sd=rep(c(2,4),4), 
                      volume=c(9,21,12,18,10,22,11,19))

  simId   farm  crop mean sd volume
1     1 farm A apple   10  2      9
2     1 farm A  pear   22  4     21
3     1 farm B apple   10  2     12
4     1 farm B  pear   22  4     18
5     2 farm A apple   10  2     10
6     2 farm A  pear   22  4     22
7     2 farm B apple   10  2     11
8     2 farm B  pear   22  4     19

Now I want to multiply these together.

I assume that to do this, I have to first "broadcast" simPrice over farms so that the two dataframes have exactly the same order.

My solution is this:

broadcast <- function(origDf, broadcast_dimList) {
    newDimDf <- do.call(expand.grid, broadcast_dimList);
    nReps <- nrow(newDimDf);
    # replicate each line of the original dataframe in place
    result <- origDf[sort(rep(row.names(origDf), nReps)), 1:ncol(origDf)]
    # add the new dimensions, repeated for each simId
    result <- cbind(newDimDf, result);
    # rename rows sequentially
    row.names(result)<-NULL; 
    return(result);
}

bcastSimPrice <- broadcast(simPrice, list(farm=c('farm A','farm B')))

    farm simId  crop mean sd price
1 farm A     1 apple   10  2     9
2 farm B     1 apple   10  2     9
3 farm A     1  pear   22  4    21
4 farm B     1  pear   22  4    21
5 farm A     2 apple   10  2    12
6 farm B     2 apple   10  2    12
7 farm A     2  pear   22  4    18
8 farm B     2  pear   22  4    18

This works, but it leaves me with the problem of now trying to match up the rows of bcastSimPrice (farms incrementing before crops) with the rows of simVol (the other way around).

Is there another way to approach this problem?

Thanks!

like image 575
Racing Tadpole Avatar asked Feb 05 '14 10:02

Racing Tadpole


People also ask

How do I display a column in a Dataframe in R?

To access a specific column in a dataframe by name, you use the $ operator in the form df$name where df is the name of the dataframe, and name is the name of the column you are interested in. This operation will then return the column you want as a vector.

How do you find the dimension of a data set in R?

The dim() function checks for the dimension, i.e, the number of rows and columns present in a data frame.

How do you call a column in R?

The column items in a data frame in R can be accessed using: Single brackets [] , which would display them as a column. Double brackets [[]] , which would display them as a list.

How do you set column values in R?

To select a column in R you can use brackets e.g., YourDataFrame['Column'] will take the column named “Column”. Furthermore, we can also use dplyr and the select() function to get columns by name or index. For instance, select(YourDataFrame, c('A', 'B') will take the columns named “A” and “B” from the dataframe.


1 Answers

Here's solution with dplyr. First we set up the data (I assumed including sd and mean in your volume data was an error)

simPrice <- data.frame(
  simId = c(1, 1, 2, 2),  
  crop = rep(c('apple', 'pear'), 2),  
  mean = rep(c(10, 22), 2),  
  sd = rep(c(2, 4), 2),  
  price = c(9, 21, 12, 18),
  stringsAsFactors = FALSE
)

simVol  <- data.frame(
  simId = c(1, 1, 1, 1, 2, 2, 2, 2),  
  farm = rep(c('farm A', 'farm A', 'farm B', 'farm B'), 2), 
  crop = rep(c('apple', 'pear'), 4),  
  volume = c(9, 21, 12, 18, 10, 22, 11, 19),
  stringsAsFactors = FALSE
)

Next we join the two datasets together (join is a slightly more common description for this task than merge). Here I'm using a left_join() which always preserves all rows on the left. mutate() adds new columns, and %.% strings the operations together.

library(dplyr)

rev <- simPrice %.% 
  left_join(simVol, by = c("simId", "crop")) %.%
  mutate(revenue = volume * price)
rev

You can also group and aggregate

rev %.%
  group_by(simId, crop, farm) %.%
  summarise(revenue = sum(revenue))

You might find dplyr useful because it names the most common data analysis operations. The introductory vignette gives more details.

like image 188
hadley Avatar answered Sep 30 '22 20:09

hadley