Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to group similar rows in R

Tags:

r

grouping

I have a table as follows:

   Rptname     Score

    Bebo23        8
    Bebo22        9
    Bebo19        10
    Alt88         12
    Alt67         11
    Jimm          5
    Jimm2         7

etc. I would like to sum into groups those rows that are similar. ie

     Bebo         27
     Alt          22
     Jimm         12

The beginning of the row name is always the similar part to group on but the number of characters that are similar can vary. I appreciate that I will have to define the groups and probably using some kind of regular expression but I'm not sure how to group and sum on this basis. Thanks for your help in advance

like image 443
Sebastian Zeki Avatar asked Jan 24 '15 07:01

Sebastian Zeki


Video Answer


2 Answers

You can strip off the numbers at the end using sub and do aggregate

do.call(`data.frame`, aggregate(Score~cbind(Rptname=sub('\\d+$', '', 
                        Rptname)), df, sum))
#    Rptname Score
#1     Alt    23
#2    Bebo    27
#3    Jimm    12

Or use transform with aggregate (as suggested by @docendo discimus)

aggregate(Score ~ Rptname, transform(df, Rptname = sub("\\d+$", 
                        "", Rptname)), sum)

Or an option with data.table

library(data.table)
 setDT(df)[, .(Score=sum(Score)),
           by=list(Rptname=sub('\\d+$','', Rptname))]

Or using rowsum (suggested by @alexis_laz

with(df, rowsum(Score, sub('\\d+$', '', Rptname)))
#     [,1]
#Alt    23
#Bebo   27
#Jimm   12

Update

If the grouping is based on first three characters, you can use substr

aggregate(Score~Rptname, transform(df, Rptname=substr(Rptname, 1,3)), sum)
#   Rptname Score
#1     Alt    23
#2     Beb    27
#3     Jim    12
like image 95
akrun Avatar answered Oct 09 '22 08:10

akrun


With dplyr:

library(dplyr)
DF %>% group_by(Rptname = sub("\\d+$", "", Rptname)) %>% summarise(Score = sum(Score))
#Source: local data frame [3 x 2]
#
#  Rptname Score
#1     Alt    23
#2    Bebo    27
#3    Jimm    12

Update:

If you want to group by the first three letters in "Rptname", you can use the following code in dplyr:

DF %>% group_by(Rptname = substr(Rptname, 1, 3)) %>% summarise(Score = sum(Score))
#Source: local data frame [3 x 2]
#
#  Rptname Score
#1     Alt    23
#2     Beb    27
#3     Jim    12
like image 21
talat Avatar answered Oct 09 '22 06:10

talat