calculate average over multiple data frames

Tags:

r

I would like to use R for plotting performance evaluation results of distinct DB systems. For each system I loaded the same data and execute the same queries in several iterations.

The data for a single systems looks like this:

"iteration", "lines", "loadTime", "query1", "query2", "query3"
1, 100000, 120.4, 0.5, 6.4, 1.2
1, 100000, 110.1, 0.1, 5.2, 2.1
1, 50000, 130.3, 0.2, 4.3, 2.2

2, 100000, 120.4, 0.1, 2.4, 1.2
2, 100000, 300.2, 0.2, 4.5, 1.4
2, 50000, 235.3, 0.4, 4.2, 0.5

3, 100000, 233.5, 0.7, 8.3, 6.7
3, 100000, 300.1, 0.9, 0.5, 4.4
3, 50000, 100.2, 0.4, 9.2, 1.2

What I need now (for plotting) is a matrix or data frame containing the average of these measurements.

At the moment I am doing this:

# read the file
all_results <- read.csv(file="file.csv", head=TRUE, sep=",")

# split the results by iteration
results <- split(all_results, all_results$iteration)

# convert each result into a data frane
r1 = as.data.frame(results[1])
r2 = as.data.frame(results[2])
r3 = as.data.frame(results[3])

# calculate the average
(r1 + r2 +r3) / 3

I could put all this into a function and calculate the average matrix in a for loop, but I have the vague feeling that there must be a more elegant solution. Any ideas?

What can I do for cases when I have incomplete results, e.g., when one iteration has less rows than the others?

Thanks!

810

asked Jan 19 '11 16:01

2 Answers

If I understand you correctly, on a given DB system, in each "iteration" (1...N) you are loading a sequence of DataSets (1,2,3) and running queries on them. It seems like at the end you want to calculate the average time across all iterations, for each DataSet. If so, you actually need to have an additional column DataSet in your all_results table that identifies the DataSet. We can add this column as follows:

all_results <- cbind( data.frame( DataSet = rep(1:3,3) ), all_results )
> all_results
  DataSet iteration  lines loadTime query1 query2 query3
1       1         1 100000    120.4    0.5    6.4    1.2
2       2         1 100000    110.1    0.1    5.2    2.1
3       3         1  50000    130.3    0.2    4.3    2.2
4       1         2 100000    120.4    0.1    2.4    1.2
5       2         2 100000    300.2    0.2    4.5    1.4
6       3         2  50000    235.3    0.4    4.2    0.5
7       1         3 100000    233.5    0.7    8.3    6.7
8       2         3 100000    300.1    0.9    0.5    4.4
9       3         3  50000    100.2    0.4    9.2    1.2

Now you can use the ddply function from the plyr package to easily extract the averages for the load and query times for each DataSet.

> ddply(all_results, .(DataSet), colwise(mean, .(loadTime, query1, query2)))
  DataSet loadTime    query1 query2
1       1 158.1000 0.4333333    5.7
2       2 236.8000 0.4000000    3.4
3       3 155.2667 0.3333333    5.9

Incidentally, I highly recommend you look at Hadley Wickham's plyr package for a rich set of data-manipulation functions

173

answered Oct 09 '22 00:10

Prasad Chalasani

I don't see why you need to split all_results by iteration. You can just use aggregate on all_results. There's no need for all iterations to have the same number of observations.

Lines <- "iteration, lines, loadTime, query1, query2, query3
1, 100000, 120.4, 0.5, 6.4, 1.2
1, 100000, 110.1, 0.1, 5.2, 2.1
1, 50000, 130.3, 0.2, 4.3, 2.2
2, 100000, 120.4, 0.1, 2.4, 1.2
2, 100000, 300.2, 0.2, 4.5, 1.4
2, 50000, 235.3, 0.4, 4.2, 0.5
3, 100000, 233.5, 0.7, 8.3, 6.7
3, 100000, 300.1, 0.9, 0.5, 4.4
3, 50000, 100.2, 0.4, 9.2, 1.2"

all_results <- read.csv(textConnection(Lines))

aggregate(all_results[,-1], by=all_results[,"iteration",drop=FALSE], mean)

answered Oct 09 '22 01:10

Joshua Ulrich

Related questions
                            
                                Error: Input files not all in same directory, please supply explicit wd
                            
                                Is there a way to use latex expression of chemarr for `gitbook` format of bookdown package?
                            
                                How to repeatedly generate non-repeating smaller groups from a larger set
                            
                                R: all possible combinations from a vector of elements with 2 possible conditions (+/-)
                            
                                Remove columns that have only a unique value
                            
                                R arrow: Error: Support for codec 'snappy' not built
                            
                                How do I build a dplyr summarize statement programmatically?
                            
                                Is there way in ggplot2 to place text on a curved path?
                            
                                R: split-apply-combine for geographic distance
                            
                                How can a function parameter be used without mentioning it in the function body?
                            
                                Plot multiple sets of points in R
                            
                                Writing temporary data from R
                            
                                How to create a "Clustergram" plot ? (in R)
                            
                                The modules in Revolution R are open sourced. Does the R license imply that I can use the R packages that comes with it free of charge? [closed]
                            
                                generate random sequence and plot in R
                            
                                how do i pass parameters to subset()?
                            
                                Why can't I pass a dataset to a function?
                            
                                How can I pass a ggplot2 aesthetic from a variable?
                            
                                How can I add a background grid using ggplot2?
                            
                                Generating a very large matrix of string combinations using combn() and bigmemory package

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

calculate average over multiple data frames

Tags:

r

behas

People also ask

2 Answers

Prasad Chalasani

Joshua Ulrich

Recent Activity

Donate For Us