I have data in the following format:
Data <- data.frame(
Names = c("Person A", "Person B","Person F", "Person G", "Person F", "Person G", "Person Q", "Person R"),
Time_Stamp = c("2013-08-01 07:06:00", "2013-08-01 07:06:00", "2013-08-01 07:53:00", "2013-08-01 07:53:00", "2013-08-01 11:01:00", "2013-08-01 11:01:00", "2013-08-01 11:08:00", "2013-08-19 06:57:00")
)
#> Data
# Names Time_Stamp
# 1 Person A 2013-08-01 07:06:00
# 2 Person B 2013-08-01 07:06:00
# 3 Person F 2013-08-01 07:53:00
# 4 Person G 2013-08-01 07:53:00
# 5 Person F 2013-08-01 11:01:00
# 6 Person G 2013-08-01 11:01:00
# 7 Person Q 2013-08-01 11:08:00
# 8 Person R 2013-08-19 06:57:00
I would like to create a code that identifies when a combination (order doesn't matter) of people appear together with the same time stamp. So, for example, Person F and Person G appear together at the same time, 8:14 on 8/1/13, so they are a group and get a unique group name. If they show up again together, they still get the same name. The issue I have been having is that the real data is nearly 100,000 rows, and I do not know how many combinations of people I have in it that appear with the same time stamp, and combinations may have more than just 2 people.
I would like the new data to look like this:
Desired <- data.frame(
Names = c("Person A", "Person B","Person F", "Person G", "Person F", "Person G", "Person Q", "Person R"),
Time_Stamp = c("2013-08-01 07:06:00", "2013-08-01 07:06:00", "2013-08-01 07:53:00", "2013-08-01 07:53:00", "2013-08-01 11:01:00", "2013-08-01 11:01:00", "2013-08-01 11:08:00", "2013-08-19 06:57:00"),
Group = c("Group 1", "Group 1", "Group 2", "Group 2", "Group 2", "Group 2", "No Group", "No Group")
)
# Names Time_Stamp Group
# 1 Person A 2013-08-01 07:06:00 Group 1
# 2 Person B 2013-08-01 07:06:00 Group 1
# 3 Person F 2013-08-01 07:53:00 Group 2
# 4 Person G 2013-08-01 07:53:00 Group 2
# 5 Person F 2013-08-01 11:01:00 Group 2
# 6 Person G 2013-08-01 11:01:00 Group 2
# 7 Person Q 2013-08-01 11:08:00 No Group
# 8 Person R 2013-08-19 06:57:00 No Group
I believe the following function does what the question asks for.
The code works as follows:
"Time_Stamp"
seen as a string.ave
to split that vector of integers by "Names"
, keeping just the first if f
spans several names.f
has only one element, return "No Group"
else paste "Group"
before that level.This function uses base R only but can be used in a dplyr::mutate
instruction.
group_names <- function(x, col.name, col.date){
f <- as.integer(as.factor(x[[col.date]]))
f <- ave(f, x[[col.name]], FUN = function(x){
if(length(x) > 1) x[1] else x
})
f <- ave(f, f, FUN = function(x){
if(length(x) == 1) "No Group" else paste("Group", x)
})
f
}
Data$Group <- group_names(Data, "Names", "Time_Stamp")
Or, with dplyr
. Both column numbers or column names work.
Data %>% mutate(Group = group_names(., 1, 2))
Data %>% mutate(Group = group_names(., "Names", "Time_Stamp"))
# Names Time_Stamp Group
#1 Person A 2013-08-01 07:06:00 Group 1
#2 Person B 2013-08-01 07:06:00 Group 1
#3 Person F 2013-08-01 07:53:00 Group 2
#4 Person G 2013-08-01 07:53:00 Group 2
#5 Person F 2013-08-01 11:01:00 Group 2
#6 Person G 2013-08-01 11:01:00 Group 2
#7 Person Q 2013-08-01 11:08:00 No Group
#8 Person R 2013-08-19 06:57:00 No Group
Here is a solution using igraph
library(igraph)
u <- graph_from_data_frame(Data)
grp <- clusters(u)$membership[match(Data$Names,names(clusters(u)$membership))]
Desired <- within(Data, Group <- ave(grp,grp,FUN = function(x) {if (length(x)>1) paste("Group",x) else "No Group"}))
such that
> Desired
Names Time_Stamp Group
1 Person A 2013-08-01 07:06:00 Group 1
2 Person B 2013-08-01 07:06:00 Group 1
3 Person F 2013-08-01 07:53:00 Group 2
4 Person G 2013-08-01 07:53:00 Group 2
5 Person F 2013-08-01 11:01:00 Group 2
6 Person G 2013-08-01 11:01:00 Group 2
7 Person Q 2013-08-01 11:08:00 No Group
8 Person R 2013-08-19 06:57:00 No Group
An option using data.table
:
library(data.table)
setDT(Data, key=c("Time_Stamp","Names"))
Data[, g := if (.N > 1L) paste(Names, collapse=""), Time_Stamp]
Data[order(g), g := fifelse(is.na(g), NA_integer_, rleid(g))]
output:
Names Time_Stamp g
1: Person A 2013-08-01 07:06:00 1
2: Person B 2013-08-01 07:06:00 1
3: Person F 2013-08-01 07:53:00 2
4: Person G 2013-08-01 07:53:00 2
5: Person F 2013-08-01 11:01:00 2
6: Person G 2013-08-01 11:01:00 2
7: Person Q 2013-08-01 11:08:00 <NA>
8: Person R 2013-08-19 06:57:00 <NA>
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With