Combining starts_with with group_by in dplyr

Question

I guess it might be a simple trick, but I do not know how to achieve it...

My dateset looks like:

Name, Score
A a,  20
A,    30
B b,   40

The output I expect is:

Name, Score
A,    50
B,    40

In one word, sum the scores that have names which start with the same word (before the space if there is one). I hope the example is self-explanatory. :)

PS: The faster the code runs, the better. The dataset is huge...

akrun · Accepted Answer

Another option would be separate

library(dplyr)
library(tidyr)
separate(df1, Name, into=c("Name", "extra")) %>% 
       group_by(Name) %>%
       summarise(Score=sum(Score))
#     Name Score
#    (chr) (int)
#1     A    50
#2     B    40

Or extract

extract(df1, Name, into= "Name", "(\S+).*") %>%
            group_by(Name) %>%
            summarise(Score = sum(Score))

Gopala · Answer

You can try something like this:

library(dplyr)
library(stringr)

df$newName <- str_extract(df$Name, '[[:alnum:]]+')
df %>% group_by(newName) %>% summarise(Score = sum(Score))

Source: local data frame [2 x 2]

  newName Score
    (chr) (int)
1       A    50
2       B    40

Note, you will want to make sure 'Name' is read as a character vector and not as factors. Use stringsAsFactors = FALSE in your read call, or use as.character to convert it.

If you want the full first 'string', you can also use this regex pattern:

df$newName <- str_extract(df$Name, '([^\s]+)')

Holmestorm · Answer

I used substr to extract the first letter and then group_by. I believe that dplyr starts_with is used to make a selection of whole columns based on their titles. This solution only works if the letter you want to select is always the first letter.

require(dplyr)
df<-data.frame(Name=c("A a,","A,","B b"),Score=c(20,30,40))

df$Name <- substr(df$Name,1,1)
df %>% group_by(Name) %>% summarise(sum_score=sum(Score))

Source: local data frame [2 x 2]

   Name sum_score
  (chr)     (dbl)
1     A        50
2     B        40

You could also create the substring column as a new column and group by that if you want to keep the original data as it is.

Jaap · Answer

starts_with is used in select and rename and are operations on the columnnames not on the values in the columns. By using gsub you can extract the first letter (or word) and then summarise. With:

sumdf <- mydf %>% 
  group_by(Name = gsub("[^A-Za-z0-9].*", "", Name)) %>% 
  summarise(sumScore = sum(Score))

you get:

> sumdf
   Name sumScore
1     A       50
2     B       40

Combining starts_with with group_by in dplyr

Tags:

r

dplyr

Isilmë O.

4 Answers

akrun

Gopala

Holmestorm

Jaap

Recent Activity

Donate For Us

Combining starts_with with group_by in dplyr

Tags:

r

dplyr

Isilmë O.

4 Answers

akrun

Gopala

Holmestorm

Jaap

Related questions

Recent Activity

Donate For Us