I need to get the mean of one column (here: score) for specific rows (here: years). Specifically, I would like to know the average score for three periods: <ul> <li>period 1: year <= 1983 </li> <li>period 2: year >= 1984 & year <= 1990</li> <li>period 3: year >= 1991</li> </ul> This is the structure of my data: <pre class="prettyprint"><code> country year score Algeria 1980 -1.1201501 Algeria 1981 -1.0526943 Algeria 1982 -1.0561565 Algeria 1983 -1.1274560 Algeria 1984 -1.1353926 Algeria 1985 -1.1734330 Algeria 1986 -1.1327666 Algeria 1987 -1.1263586 Algeria 1988 -0.8529455 Algeria 1989 -0.2930265 Algeria 1990 -0.1564207 Algeria 1991 -0.1526328 Algeria 1992 -0.9757842 Algeria 1993 -0.9714060 Algeria 1994 -1.1422258 Algeria 1995 -0.3675797 ... </code></pre> The calculated mean values should be added to the df in an additional column ("mean"), i.e. same mean value for years of period 1, for those of period 2 etc. This is how it should look like: <pre class="prettyprint"><code>country year score mean Algeria 1980 -1.1201501 -1.089 Algeria 1981 -1.0526943 -1.089 Algeria 1982 -1.0561565 -1.089 Algeria 1983 -1.1274560 -1.089 Algeria 1984 -1.1353926 -0.839 Algeria 1985 -1.1734330 -0.839 Algeria 1986 -1.1327666 -0.839 Algeria 1987 -1.1263586 -0.839 Algeria 1988 -0.8529455 -0.839 Algeria 1989 -0.2930265 -0.839 Algeria 1990 -0.1564207 -0.839 ... </code></pre> Every possible path I tried got easily super complicated - and I have to calculate the mean scores for different periods of time for over 90 countries ... Many many thanks for your help!

<strike>Since <code>findInterval</code> requires <code>year</code> to be sorted (as it is in your example) I'd be tempted to use <code>cut</code> in case it isn't sorted</strike> [proved wrong, thanks @DWin]. For completeness the <code>data.table</code> equivalent (scales for large data) is : <pre class="prettyprint"><code>require(data.table) DT = as.data.table(DF) # or just start with a data.table in the first place DT[, mean:=mean(score), by=cut(year,c(-Inf,1984,1991,Inf))] </code></pre> or <code>findInterval</code> is likely faster as DWin used : <pre class="prettyprint"><code>DT[, mean:=mean(score), by=findInterval(year,c(-Inf,1984,1991,Inf))] </code></pre>

How to get column mean for specific rows only?

Tags:

dataframe

r

mean

I need to get the mean of one column (here: score) for specific rows (here: years). Specifically, I would like to know the average score for three periods:

period 1: year <= 1983
period 2: year >= 1984 & year <= 1990
period 3: year >= 1991

This is the structure of my data:

  country year     score        
 Algeria 1980     -1.1201501 
 Algeria 1981     -1.0526943 
 Algeria 1982     -1.0561565 
 Algeria 1983     -1.1274560 
 Algeria 1984     -1.1353926 
 Algeria 1985     -1.1734330 
 Algeria 1986     -1.1327666 
 Algeria 1987     -1.1263586 
 Algeria 1988     -0.8529455 
 Algeria 1989     -0.2930265 
 Algeria 1990     -0.1564207 
 Algeria 1991     -0.1526328 
 Algeria 1992     -0.9757842 
 Algeria 1993     -0.9714060 
 Algeria 1994     -1.1422258 
 Algeria 1995     -0.3675797 
 ...

The calculated mean values should be added to the df in an additional column ("mean"), i.e. same mean value for years of period 1, for those of period 2 etc.

This is how it should look like:

country year     score         mean   
 Algeria 1980     -1.1201501     -1.089
 Algeria 1981     -1.0526943     -1.089
 Algeria 1982     -1.0561565     -1.089
 Algeria 1983     -1.1274560     -1.089
 Algeria 1984     -1.1353926     -0.839
 Algeria 1985     -1.1734330     -0.839
 Algeria 1986     -1.1327666     -0.839
 Algeria 1987     -1.1263586     -0.839
 Algeria 1988     -0.8529455     -0.839
 Algeria 1989     -0.2930265     -0.839
 Algeria 1990     -0.1564207     -0.839
 ...

Every possible path I tried got easily super complicated - and I have to calculate the mean scores for different periods of time for over 90 countries ...

Many many thanks for your help!

385

asked Sep 12 '12 18:09

TiF

2 Answers

datfrm$mean <-
  with (datfrm, ave( score, findInterval(year, c(-Inf, 1984, 1991, Inf)), FUN= mean) )

The title question is a bit different than the real question and would be answered by using logical indexing. If one wanted only the mean for a particular subset say year >= 1984 & year <= 1990 it would be done via:

mn84_90 <- with(datfrm, mean(score[year >= 1984 & year <= 1990]) )

164

answered Nov 06 '22 02:11

IRTFM

~~Since findInterval requires year to be sorted (as it is in your example) I'd be tempted to use cut in case it isn't sorted~~ [proved wrong, thanks @DWin]. For completeness the data.table equivalent (scales for large data) is :

require(data.table)
DT = as.data.table(DF)   # or just start with a data.table in the first place

DT[, mean:=mean(score), by=cut(year,c(-Inf,1984,1991,Inf))]

or findInterval is likely faster as DWin used :

DT[, mean:=mean(score), by=findInterval(year,c(-Inf,1984,1991,Inf))]

answered Nov 06 '22 01:11

Matt Dowle

Related questions
                            
                                Issue with geom_text when using position_dodge
                            
                                Variation on "How to plot decision boundary of a k-nearest neighbor classifier from Elements of Statistical Learning?"
                            
                                R: Generate data from a probability density distribution
                            
                                Plotting expression trees in R
                            
                                R: Assign values to a new column based on values of another column where a condition is satisfied
                            
                                pandas equivalent for R dcast
                            
                                Extracting unique values from data frame using R
                            
                                Find start and end positions/indices of runs/consecutive values
                            
                                How to style an single individual selectInput menu in R Shiny?
                            
                                list members can be accessed with partial name? Is this a feature?
                            
                                Which data.table syntax for left join (one column) to prefer
                            
                                How to facet a plot_ly() chart?
                            
                                merge dataframes based on multiple columns and thresholds
                            
                                How can I neatly clean my R workspace while preserving certain objects?
                            
                                How to write an R function that evaluates an expression within a data-frame
                            
                                Removing all margins in an R graphics device
                            
                                How to sort a data frame in R
                            
                                How can I pass R variable into sqldf?
                            
                                Annotation above bars:
                            
                                merging a large list of xts objects

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With