Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

R: How to identify unknown number of combinations?

I have data in the following format:

Data <- data.frame(
  Names = c("Person A", "Person B","Person F", "Person G", "Person F", "Person G", "Person Q", "Person R"),
  Time_Stamp = c("2013-08-01 07:06:00", "2013-08-01 07:06:00", "2013-08-01 07:53:00", "2013-08-01 07:53:00", "2013-08-01 11:01:00", "2013-08-01 11:01:00", "2013-08-01 11:08:00", "2013-08-19 06:57:00")
)

#> Data
#      Names          Time_Stamp
# 1 Person A 2013-08-01 07:06:00
# 2 Person B 2013-08-01 07:06:00
# 3 Person F 2013-08-01 07:53:00
# 4 Person G 2013-08-01 07:53:00
# 5 Person F 2013-08-01 11:01:00
# 6 Person G 2013-08-01 11:01:00
# 7 Person Q 2013-08-01 11:08:00
# 8 Person R 2013-08-19 06:57:00

I would like to create a code that identifies when a combination (order doesn't matter) of people appear together with the same time stamp. So, for example, Person F and Person G appear together at the same time, 8:14 on 8/1/13, so they are a group and get a unique group name. If they show up again together, they still get the same name. The issue I have been having is that the real data is nearly 100,000 rows, and I do not know how many combinations of people I have in it that appear with the same time stamp, and combinations may have more than just 2 people.

I would like the new data to look like this:

Desired <- data.frame(
  Names = c("Person A", "Person B","Person F", "Person G", "Person F", "Person G", "Person Q", "Person R"),
  Time_Stamp = c("2013-08-01 07:06:00", "2013-08-01 07:06:00", "2013-08-01 07:53:00", "2013-08-01 07:53:00", "2013-08-01 11:01:00", "2013-08-01 11:01:00", "2013-08-01 11:08:00", "2013-08-19 06:57:00"),
  Group = c("Group 1", "Group 1", "Group 2", "Group 2", "Group 2", "Group 2", "No Group", "No Group")
)
#      Names          Time_Stamp    Group
# 1 Person A 2013-08-01 07:06:00  Group 1
# 2 Person B 2013-08-01 07:06:00  Group 1
# 3 Person F 2013-08-01 07:53:00  Group 2
# 4 Person G 2013-08-01 07:53:00  Group 2
# 5 Person F 2013-08-01 11:01:00  Group 2
# 6 Person G 2013-08-01 11:01:00  Group 2
# 7 Person Q 2013-08-01 11:08:00 No Group
# 8 Person R 2013-08-19 06:57:00 No Group
like image 844
ellie9813 Avatar asked Apr 16 '20 20:04

ellie9813


3 Answers

I believe the following function does what the question asks for.

The code works as follows:

  1. Factors are coded internally as consecutive integers, so coerce to factor and then to integer to get unique integers per each unique "Time_Stamp" seen as a string.
  2. Use ave to split that vector of integers by "Names", keeping just the first if f spans several names.
  3. Finally, if each level of f has only one element, return "No Group" else paste "Group" before that level.

This function uses base R only but can be used in a dplyr::mutate instruction.

group_names <- function(x, col.name, col.date){
  f <- as.integer(as.factor(x[[col.date]]))
  f <- ave(f, x[[col.name]], FUN = function(x){
    if(length(x) > 1) x[1] else x
  })
  f <- ave(f, f, FUN = function(x){
    if(length(x) == 1) "No Group" else paste("Group", x)
  })
  f
}

Data$Group <- group_names(Data, "Names", "Time_Stamp")

Or, with dplyr. Both column numbers or column names work.

Data %>% mutate(Group = group_names(., 1, 2))
Data %>% mutate(Group = group_names(., "Names", "Time_Stamp"))
#     Names          Time_Stamp    Group
#1 Person A 2013-08-01 07:06:00  Group 1
#2 Person B 2013-08-01 07:06:00  Group 1
#3 Person F 2013-08-01 07:53:00  Group 2
#4 Person G 2013-08-01 07:53:00  Group 2
#5 Person F 2013-08-01 11:01:00  Group 2
#6 Person G 2013-08-01 11:01:00  Group 2
#7 Person Q 2013-08-01 11:08:00 No Group
#8 Person R 2013-08-19 06:57:00 No Group
like image 69
Rui Barradas Avatar answered Oct 22 '22 14:10

Rui Barradas


Here is a solution using igraph

library(igraph)
u <- graph_from_data_frame(Data)
grp <- clusters(u)$membership[match(Data$Names,names(clusters(u)$membership))]
Desired <- within(Data, Group <- ave(grp,grp,FUN = function(x) {if (length(x)>1) paste("Group",x) else "No Group"}))

such that

> Desired
     Names          Time_Stamp    Group
1 Person A 2013-08-01 07:06:00  Group 1
2 Person B 2013-08-01 07:06:00  Group 1
3 Person F 2013-08-01 07:53:00  Group 2
4 Person G 2013-08-01 07:53:00  Group 2
5 Person F 2013-08-01 11:01:00  Group 2
6 Person G 2013-08-01 11:01:00  Group 2
7 Person Q 2013-08-01 11:08:00 No Group
8 Person R 2013-08-19 06:57:00 No Group
like image 35
ThomasIsCoding Avatar answered Oct 22 '22 13:10

ThomasIsCoding


An option using data.table:

library(data.table)
setDT(Data, key=c("Time_Stamp","Names"))
Data[, g := if (.N > 1L) paste(Names, collapse=""), Time_Stamp]
Data[order(g), g := fifelse(is.na(g), NA_integer_, rleid(g))]

output:

      Names          Time_Stamp    g
1: Person A 2013-08-01 07:06:00    1
2: Person B 2013-08-01 07:06:00    1
3: Person F 2013-08-01 07:53:00    2
4: Person G 2013-08-01 07:53:00    2
5: Person F 2013-08-01 11:01:00    2
6: Person G 2013-08-01 11:01:00    2
7: Person Q 2013-08-01 11:08:00 <NA>
8: Person R 2013-08-19 06:57:00 <NA>
like image 22
chinsoon12 Avatar answered Oct 22 '22 14:10

chinsoon12