Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

grouping, comparing, and counting rows in r

I have a data frame that looks as the following:

     system    Id initial final
665       9 16001    6070  6071
683      10 16001    6100  6101
696      11 16001    6101  6113
712      10 16971    6150  6151
715      11 16971    6151  6163
4966      7  4118   10238 10242
5031      9  4118   10260 10278
5088     10  4118   10279 10304
5115     11  4118   10305 10317


structure(list(system = c(9L, 10L, 11L, 10L, 11L, 7L, 9L, 10L, 
11L), Id = c(16001L, 16001L, 16001L, 16971L, 16971L, 4118L, 4118L, 
4118L, 4118L), initial = c(6070, 6100, 6101, 6150, 6151, 10238, 
10260, 10279, 10305), final = c(6071, 6101, 6113, 6151, 6163, 
10242, 10278, 10304, 10317)), .Names = c("system", "Id", "initial", 
"final"), row.names = c(665L, 683L, 696L, 712L, 715L, 4966L, 
5031L, 5088L, 5115L), class = "data.frame")

I would like to get a new data frame with the next structure

     Id  system length initial final
1 16001 9,10,11      3    6070  6113
2 16971   10,11      2    6150  6163
3  4118       7      1   10238 10242
4  4118 9,10,11      3   10260 10317


structure(list(Id = c(16001L, 16971L, 4118L, 4118L), system =     structure(c(3L, 
1L, 2L, 3L), .Label = c("10,11", "7", "9,10,11"), class =     "factor"), 
    length = c(3L, 2L, 1L, 3L), initial = c(6070L, 6150L, 10238L, 
    10260L), final = c(6113, 6163, 10242, 10317)), .Names = c("Id", 
"system", "length", "initial", "final"), class = "data.frame",     row.names = c(NA, 
-4L))

The grouping is by Id and the difference (between rows) in "system" field equal to one. Also I would like to get the different "system" and how many of that involved in grouping. Finally a column with the first "initial" and the last "final" involved also.

It is possible to do that in r? Thanks.

like image 791
user3060550 Avatar asked Sep 29 '22 19:09

user3060550


1 Answers

You could use data.table. Convert "data.frame" to "data.table" (setDT), create a grouping variable "indx" by taking the difference of adjacent elements of "system" (diff(system)), cumsum the logical vector, use "Id" and "indx" as grouping variable to get the statistics.

library(data.table)
 setDT(df)[,list(system=toString(system), length=.N, initial=initial[1L],
  final=final[.N]), by=list(Id,indx=cumsum(c(TRUE, diff(system)!=1)))][,
   indx:=NULL][]

#      Id    system length initial final
#1: 16001 9, 10, 11      3    6070  6113
#2: 16971    10, 11      2    6150  6163
#3:  4118         7      1   10238 10242
#4:  4118 9, 10, 11      3   10260 10317

Or based on @jazzurro's comment about using first/last functions from dplyr,

 library(dplyr)
 df %>% 
    group_by(indx=cumsum(c(TRUE, diff(system)!=1)), Id) %>% 
    summarise(system=toString(system), length=n(), 
    initial=first(initial), final=last(final))
like image 172
akrun Avatar answered Oct 07 '22 18:10

akrun