Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

calculate average over multiple data frames

Tags:

r

I would like to use R for plotting performance evaluation results of distinct DB systems. For each system I loaded the same data and execute the same queries in several iterations.

The data for a single systems looks like this:

"iteration", "lines", "loadTime", "query1", "query2", "query3"
1, 100000, 120.4, 0.5, 6.4, 1.2
1, 100000, 110.1, 0.1, 5.2, 2.1
1, 50000, 130.3, 0.2, 4.3, 2.2

2, 100000, 120.4, 0.1, 2.4, 1.2
2, 100000, 300.2, 0.2, 4.5, 1.4
2, 50000, 235.3, 0.4, 4.2, 0.5

3, 100000, 233.5, 0.7, 8.3, 6.7
3, 100000, 300.1, 0.9, 0.5, 4.4
3, 50000, 100.2, 0.4, 9.2, 1.2

What I need now (for plotting) is a matrix or data frame containing the average of these measurements.

At the moment I am doing this:

# read the file
all_results <- read.csv(file="file.csv", head=TRUE, sep=",")

# split the results by iteration
results <- split(all_results, all_results$iteration)

# convert each result into a data frane
r1 = as.data.frame(results[1])
r2 = as.data.frame(results[2])
r3 = as.data.frame(results[3])

# calculate the average
(r1 + r2 +r3) / 3

I could put all this into a function and calculate the average matrix in a for loop, but I have the vague feeling that there must be a more elegant solution. Any ideas?

What can I do for cases when I have incomplete results, e.g., when one iteration has less rows than the others?

Thanks!

like image 810
behas Avatar asked Jan 19 '11 16:01

behas


People also ask

How do I combine multiple data frames into one?

The concat() function can be used to concatenate two Dataframes by adding the rows of one to the other. The merge() function is equivalent to the SQL JOIN clause. 'left', 'right' and 'inner' joins are all possible.

How do I combine multiple Dataframes in pandas?

When concatenating datasets vertically, assuming the dataframes have the same column names and the order of the columns is the same, we can simply use the pandas. concat() method to perform the concatenation.

How do you find the average of a Dataframe in R?

In this method of computing, the mean of the given dataframe column user just need to call the colMeans function which is an in-build function in R language and pass the dataframe as its parameter, then this will be returning the mean of all the column in the provided dataframe to the user.

How do I get the mean of a column in pandas?

To get column average or mean from pandas DataFrame use either mean() and describe() method. The DataFrame. mean() method is used to return the mean of the values for the requested axis.


2 Answers

If I understand you correctly, on a given DB system, in each "iteration" (1...N) you are loading a sequence of DataSets (1,2,3) and running queries on them. It seems like at the end you want to calculate the average time across all iterations, for each DataSet. If so, you actually need to have an additional column DataSet in your all_results table that identifies the DataSet. We can add this column as follows:

all_results <- cbind( data.frame( DataSet = rep(1:3,3) ), all_results )
> all_results
  DataSet iteration  lines loadTime query1 query2 query3
1       1         1 100000    120.4    0.5    6.4    1.2
2       2         1 100000    110.1    0.1    5.2    2.1
3       3         1  50000    130.3    0.2    4.3    2.2
4       1         2 100000    120.4    0.1    2.4    1.2
5       2         2 100000    300.2    0.2    4.5    1.4
6       3         2  50000    235.3    0.4    4.2    0.5
7       1         3 100000    233.5    0.7    8.3    6.7
8       2         3 100000    300.1    0.9    0.5    4.4
9       3         3  50000    100.2    0.4    9.2    1.2

Now you can use the ddply function from the plyr package to easily extract the averages for the load and query times for each DataSet.

> ddply(all_results, .(DataSet), colwise(mean, .(loadTime, query1, query2)))
  DataSet loadTime    query1 query2
1       1 158.1000 0.4333333    5.7
2       2 236.8000 0.4000000    3.4
3       3 155.2667 0.3333333    5.9

Incidentally, I highly recommend you look at Hadley Wickham's plyr package for a rich set of data-manipulation functions

like image 173
Prasad Chalasani Avatar answered Oct 09 '22 00:10

Prasad Chalasani


I don't see why you need to split all_results by iteration. You can just use aggregate on all_results. There's no need for all iterations to have the same number of observations.

Lines <- "iteration, lines, loadTime, query1, query2, query3
1, 100000, 120.4, 0.5, 6.4, 1.2
1, 100000, 110.1, 0.1, 5.2, 2.1
1, 50000, 130.3, 0.2, 4.3, 2.2
2, 100000, 120.4, 0.1, 2.4, 1.2
2, 100000, 300.2, 0.2, 4.5, 1.4
2, 50000, 235.3, 0.4, 4.2, 0.5
3, 100000, 233.5, 0.7, 8.3, 6.7
3, 100000, 300.1, 0.9, 0.5, 4.4
3, 50000, 100.2, 0.4, 9.2, 1.2"

all_results <- read.csv(textConnection(Lines))

aggregate(all_results[,-1], by=all_results[,"iteration",drop=FALSE], mean)
like image 44
Joshua Ulrich Avatar answered Oct 09 '22 01:10

Joshua Ulrich