Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Combining starts_with with group_by in dplyr

Tags:

r

dplyr

I guess it might be a simple trick, but I do not know how to achieve it...

My dateset looks like:

Name, Score
A a,  20
A,    30
B b,   40

The output I expect is:

Name, Score
A,    50
B,    40

In one word, sum the scores that have names which start with the same word (before the space if there is one). I hope the example is self-explanatory. :)

PS: The faster the code runs, the better. The dataset is huge...

like image 689
Isilmë O. Avatar asked Jan 23 '16 14:01

Isilmë O.


4 Answers

Another option would be separate

library(dplyr)
library(tidyr)
separate(df1, Name, into=c("Name", "extra")) %>% 
       group_by(Name) %>%
       summarise(Score=sum(Score))
#     Name Score
#    (chr) (int)
#1     A    50
#2     B    40

Or extract

extract(df1, Name, into= "Name", "(\\S+).*") %>%
            group_by(Name) %>%
            summarise(Score = sum(Score))  
like image 186
akrun Avatar answered Oct 20 '22 01:10

akrun


You can try something like this:

library(dplyr)
library(stringr)

df$newName <- str_extract(df$Name, '[[:alnum:]]+')
df %>% group_by(newName) %>% summarise(Score = sum(Score))

Source: local data frame [2 x 2]

  newName Score
    (chr) (int)
1       A    50
2       B    40

Note, you will want to make sure 'Name' is read as a character vector and not as factors. Use stringsAsFactors = FALSE in your read call, or use as.character to convert it.

If you want the full first 'string', you can also use this regex pattern:

df$newName <- str_extract(df$Name, '([^\\s]+)')
like image 45
Gopala Avatar answered Oct 20 '22 01:10

Gopala


I used substr to extract the first letter and then group_by. I believe that dplyr starts_with is used to make a selection of whole columns based on their titles. This solution only works if the letter you want to select is always the first letter.

require(dplyr)
df<-data.frame(Name=c("A a,","A,","B b"),Score=c(20,30,40))

df$Name <- substr(df$Name,1,1)
df %>% group_by(Name) %>% summarise(sum_score=sum(Score))

Source: local data frame [2 x 2]

   Name sum_score
  (chr)     (dbl)
1     A        50
2     B        40

You could also create the substring column as a new column and group by that if you want to keep the original data as it is.

like image 3
Holmestorm Avatar answered Oct 20 '22 01:10

Holmestorm


starts_with is used in select and rename and are operations on the columnnames not on the values in the columns. By using gsub you can extract the first letter (or word) and then summarise. With:

sumdf <- mydf %>% 
  group_by(Name = gsub("[^A-Za-z0-9].*", "", Name)) %>% 
  summarise(sumScore = sum(Score))

you get:

> sumdf
   Name sumScore
1     A       50
2     B       40
like image 3
Jaap Avatar answered Oct 20 '22 01:10

Jaap