I have an R dataframe:
# here just define it directly, but it comes from a simulation
simPrice <- data.frame(simId=c(1,1,2,2),
crop=rep(c('apple','pear'),2),
mean=rep(c(10,22),2),
sd=rep(c(2,4),2),
price=c(9,21,12,18))
simId crop mean sd price
1 1 apple 10 2 9
2 1 pear 22 4 21
3 2 apple 10 2 12
4 2 pear 22 4 18
This is the price of fruit (apples and pears) in two different iterations of a simulation. In general, I may have any number of fruit or iterations. Crucially, I may also have other columns (e.g. varieties, date sold, location sold, etc).
I have another dataframe giving the volume of fruit grown at a number of farms:
# here just define it directly, but it comes from a simulation
simVol <- data.frame(simId=c(1,1,1,1,2,2,2,2),
farm=rep(c('farm A', 'farm A', 'farm B', 'farm B'),2),
crop=rep(c('apple','pear'),4),
mean=rep(c(10,22),4),
sd=rep(c(2,4),4),
volume=c(9,21,12,18,10,22,11,19))
simId farm crop mean sd volume
1 1 farm A apple 10 2 9
2 1 farm A pear 22 4 21
3 1 farm B apple 10 2 12
4 1 farm B pear 22 4 18
5 2 farm A apple 10 2 10
6 2 farm A pear 22 4 22
7 2 farm B apple 10 2 11
8 2 farm B pear 22 4 19
Now I want to multiply these together.
I assume that to do this, I have to first "broadcast" simPrice
over farm
s so that the two dataframes have exactly the same order.
My solution is this:
broadcast <- function(origDf, broadcast_dimList) {
newDimDf <- do.call(expand.grid, broadcast_dimList);
nReps <- nrow(newDimDf);
# replicate each line of the original dataframe in place
result <- origDf[sort(rep(row.names(origDf), nReps)), 1:ncol(origDf)]
# add the new dimensions, repeated for each simId
result <- cbind(newDimDf, result);
# rename rows sequentially
row.names(result)<-NULL;
return(result);
}
bcastSimPrice <- broadcast(simPrice, list(farm=c('farm A','farm B')))
farm simId crop mean sd price
1 farm A 1 apple 10 2 9
2 farm B 1 apple 10 2 9
3 farm A 1 pear 22 4 21
4 farm B 1 pear 22 4 21
5 farm A 2 apple 10 2 12
6 farm B 2 apple 10 2 12
7 farm A 2 pear 22 4 18
8 farm B 2 pear 22 4 18
This works, but it leaves me with the problem of now trying to match up the rows of bcastSimPrice
(farms incrementing before crops) with the rows of simVol
(the other way around).
Is there another way to approach this problem?
Thanks!
To access a specific column in a dataframe by name, you use the $ operator in the form df$name where df is the name of the dataframe, and name is the name of the column you are interested in. This operation will then return the column you want as a vector.
The dim() function checks for the dimension, i.e, the number of rows and columns present in a data frame.
The column items in a data frame in R can be accessed using: Single brackets [] , which would display them as a column. Double brackets [[]] , which would display them as a list.
To select a column in R you can use brackets e.g., YourDataFrame['Column'] will take the column named “Column”. Furthermore, we can also use dplyr and the select() function to get columns by name or index. For instance, select(YourDataFrame, c('A', 'B') will take the columns named “A” and “B” from the dataframe.
Here's solution with dplyr. First we set up the data (I assumed including sd and mean in your volume data was an error)
simPrice <- data.frame(
simId = c(1, 1, 2, 2),
crop = rep(c('apple', 'pear'), 2),
mean = rep(c(10, 22), 2),
sd = rep(c(2, 4), 2),
price = c(9, 21, 12, 18),
stringsAsFactors = FALSE
)
simVol <- data.frame(
simId = c(1, 1, 1, 1, 2, 2, 2, 2),
farm = rep(c('farm A', 'farm A', 'farm B', 'farm B'), 2),
crop = rep(c('apple', 'pear'), 4),
volume = c(9, 21, 12, 18, 10, 22, 11, 19),
stringsAsFactors = FALSE
)
Next we join the two datasets together (join is a slightly more common description for this task than merge). Here I'm using a left_join()
which always preserves all rows on the left. mutate()
adds new columns, and %.%
strings the operations together.
library(dplyr)
rev <- simPrice %.%
left_join(simVol, by = c("simId", "crop")) %.%
mutate(revenue = volume * price)
rev
You can also group and aggregate
rev %.%
group_by(simId, crop, farm) %.%
summarise(revenue = sum(revenue))
You might find dplyr useful because it names the most common data analysis operations. The introductory vignette gives more details.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With