Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to get column mean for specific rows only?

Tags:

dataframe

r

mean

I need to get the mean of one column (here: score) for specific rows (here: years). Specifically, I would like to know the average score for three periods:

  • period 1: year <= 1983
  • period 2: year >= 1984 & year <= 1990
  • period 3: year >= 1991

This is the structure of my data:

  country year     score        
 Algeria 1980     -1.1201501 
 Algeria 1981     -1.0526943 
 Algeria 1982     -1.0561565 
 Algeria 1983     -1.1274560 
 Algeria 1984     -1.1353926 
 Algeria 1985     -1.1734330 
 Algeria 1986     -1.1327666 
 Algeria 1987     -1.1263586 
 Algeria 1988     -0.8529455 
 Algeria 1989     -0.2930265 
 Algeria 1990     -0.1564207 
 Algeria 1991     -0.1526328 
 Algeria 1992     -0.9757842 
 Algeria 1993     -0.9714060 
 Algeria 1994     -1.1422258 
 Algeria 1995     -0.3675797 
 ...

The calculated mean values should be added to the df in an additional column ("mean"), i.e. same mean value for years of period 1, for those of period 2 etc.

This is how it should look like:

country year     score         mean   
 Algeria 1980     -1.1201501     -1.089
 Algeria 1981     -1.0526943     -1.089
 Algeria 1982     -1.0561565     -1.089
 Algeria 1983     -1.1274560     -1.089
 Algeria 1984     -1.1353926     -0.839
 Algeria 1985     -1.1734330     -0.839
 Algeria 1986     -1.1327666     -0.839
 Algeria 1987     -1.1263586     -0.839
 Algeria 1988     -0.8529455     -0.839
 Algeria 1989     -0.2930265     -0.839
 Algeria 1990     -0.1564207     -0.839
 ...

Every possible path I tried got easily super complicated - and I have to calculate the mean scores for different periods of time for over 90 countries ...

Many many thanks for your help!

like image 385
TiF Avatar asked Sep 12 '12 18:09

TiF


People also ask

How do you find the mean of only certain rows in R?

The rowMeans() function in R can be used to calculate the mean of several rows of a matrix or data frame in R.

How do you calculate mean of specific rows in pandas?

The DataFrame.mean() method is used to return the mean of the values for the requested axis. If you apply this method on a series object, then it returns a scalar value, which is the mean value of all the observations in the pandas DataFrame.

How do you find the mean of a row in a DataFrame?

The row average can be found using DataFrame. mean() function. It returns the mean of the values over the requested axis. If axis = 0 , the mean function is applied over the columns.

How do you find the mean of a selected column in R?

ColMeans() Function along with sapply() is used to get the mean of the multiple column. Dataframe is passed as an argument to ColMeans() Function. Mean of numeric columns of the dataframe is calculated.


2 Answers

datfrm$mean <-
  with (datfrm, ave( score, findInterval(year, c(-Inf, 1984, 1991, Inf)), FUN= mean) )

The title question is a bit different than the real question and would be answered by using logical indexing. If one wanted only the mean for a particular subset say year >= 1984 & year <= 1990 it would be done via:

mn84_90 <- with(datfrm, mean(score[year >= 1984 & year <= 1990]) )
like image 164
IRTFM Avatar answered Nov 06 '22 02:11

IRTFM


Since findInterval requires year to be sorted (as it is in your example) I'd be tempted to use cut in case it isn't sorted [proved wrong, thanks @DWin]. For completeness the data.table equivalent (scales for large data) is :

require(data.table)
DT = as.data.table(DF)   # or just start with a data.table in the first place

DT[, mean:=mean(score), by=cut(year,c(-Inf,1984,1991,Inf))]

or findInterval is likely faster as DWin used :

DT[, mean:=mean(score), by=findInterval(year,c(-Inf,1984,1991,Inf))]
like image 36
Matt Dowle Avatar answered Nov 06 '22 01:11

Matt Dowle