Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Using summarise with weighted mean from dplyr in R

Tags:

r

dplyr

I'm trying to tidy a dataset, using dplyr. My variables contain percentages and straightforward values (in this case, page views and bounce rates). I've tried to summarize them this way:

require(dplyr)
df<-df%>%
   group_by(pagename)%>%
   summarise(pageviews=sum(pageviews), bounceRate= weighted.mean(bounceRate,pageviews))

But this returns:

 Error: 'x' and 'w' must have the same length

My dataset does not have any NA's in the both the page views and the bounce rates. I'm not sure what I'm doing wrong, maybe summarise() doesn't work with weighted.mean()?

EDIT

I've added some data:

### Source: local data frame [4 x 3]

###               pagename bounceRate pageviews
                    (chr)      (dbl)     (dbl)
###1                url1   72.22222      1176
###2                url2   46.42857       733
###3                url2   76.92308       457
###4                url3   62.06897       601
like image 659
Tobias van Elferen Avatar asked Mar 23 '17 14:03

Tobias van Elferen


People also ask

How do you use the weighted mean function in R?

Data Visualization using R Programming Weighted mean is the average which is determined by finding the sum of the products of weights and the values then dividing this sum by the sum of total weights. If the weights are in proportion then the total sum of the weights should be 1.

How do you calculate weighted mean?

The weighted mean is a type of mean that is calculated by multiplying the weight (or probability) associated with a particular event or outcome with its associated quantitative outcome and then summing all the products together.

Is there a weighted median?

1 The Weighted Median. The weighted median is an even better measure of central tendency than the plain median. It is also more “set-oriented” than the plain median. It factors in the number of times the two values in the middle subset of a table with an even number of rows appear.


2 Answers

The summarize() command replaces variables in the order they appear in the command, so because you are changing the value of pageviews, that new value is being used in the weighted.mean. It's safer to use different names

df %>%
   group_by(pagename)%>%
   summarise(pageviews_sum = sum(pageviews), 
      bounceRate_mean = weighted.mean(bounceRate,pageviews))

And if you really want, you can rename afterward

df %>%
   group_by(pagename) %>%
   summarise(pageviews_sum = sum(pageviews), 
      bounceRate_mean = weighted.mean(bounceRate,pageviews)) %>% 
   rename(pageviews = pageviews_sum, bounceRate = bounceRate_mean)
like image 69
MrFlick Avatar answered Oct 20 '22 06:10

MrFlick


I've found the solution. Since summarise(pageviews=sum(pageviews) is evaluated before bounceRate= weighted.mean(bounceRate,pageviews), the length of pageviewsis reduced and therefore shorter than bounceRate, which triggers the error.

The solution is simple, just switch them:

require(dplyr)
df<-df%>%
  group_by(pagename)%>%
  summarise(bounceRate= weighted.mean(bounceRate,pageviews),pageviews=sum(pageviews))
like image 42
Tobias van Elferen Avatar answered Oct 20 '22 04:10

Tobias van Elferen