I have the following data:
dat <- data.frame(user_id = c(101,102,102,103,103,106),
phone_number = c(4030201, 4030201, 4030202, 4030202, 4030203, 4030204))
I want to calculate the unique users. As you can see, here we have 2 unique users. So, ultimately the table I want to create is the following:
user_id phone_number new_user_id
101 4030201 1
102 4030201 1
102 4030202 1
103 4030202 1
103 4030203 1
106 4030204 2
Any ideas on how could I calculate this in R? Or in a different language and then I can translate the code to R.
Updated02 (Some minor tweaks needed to be made)
I had to ask two questions to be able to solve it. If you are dealing with this kind of questions a lot, you are required to learn igraph
package which is primarily used for network analysis. There maybe a more simple way of doing it but for now I think it will do. Let's walk you through it:
library(dplyr)
library(purrr)
# In the firs chunk we iterate over every row of your data set to find out
# whether there is a connection between the corresponding rows and the others
map(1:nrow(dat), function(x) {
dat %>%
mutate(id = row_number()) %>%
pmap_lgl(., ~ {x <- unlist(dat[x,]);
any(x %in% c(...))})
}) %>%
exec(cbind, !!!.) %>%
as.data.frame() -> dat2
dat2 %>%
pmap(~ sub("V", "", names(dat2))[c(...)] %>% as.numeric()) -> ids
[[1]]
[1] 1 2
[[2]]
[1] 1 2 3
[[3]]
[1] 2 3 4
[[4]]
[1] 3 4 5
[[5]]
[1] 4 5 8
[[6]]
[1] 6
[[7]]
[1] 7
[[8]]
[1] 5 8
Then we group all the related id
s together. For this part I used solutions proposed by my dear friends @det & @Ian Campbell cause I don't know how to use igraph
.
library(igraph)
map(ids, function(a) map_int(ids, ~length(base::intersect(a, .x)) > 0) * 1L) %>%
reduce(rbind) %>%
graph.adjacency() %>%
as.undirected() %>%
components() %>%
pluck("membership") %>%
split(seq_along(.), .) %>%
map(~unique(unlist(ids[.x]))) -> grouped_ids
$`1`
[1] 1 2 3 4 5 8
$`2`
[1] 6
$`3`
[1] 7
After we grouped all the related once together, we can then group our data set:
dat %>%
mutate(id = row_number()) %>%
rowwise() %>%
mutate(grp = seq(length(grouped_ids))[map_lgl(grouped_ids, ~ id %in% .x)])
user_id phone_number id grp
1 101 4030201 1 1
2 102 4030201 2 1
3 102 4030202 3 1
4 103 4030202 4 1
5 103 4030203 5 1
6 106 4030204 6 2
7 107 4030205 7 3
8 111 4030203 8 1
Data
structure(list(user_id = c(101, 102, 102, 103, 103, 106, 107,
111), phone_number = c(4030201, 4030201, 4030202, 4030202, 4030203,
4030204, 4030205, 4030203)), class = "data.frame", row.names = c(NA,
-8L))
Simplifying my friend's answer a bit
dat <- data.frame(user_id = c(101,102,102,103,103,106),
phone_number = c(4030201, 4030201, 4030202, 4030202, 4030203, 4030204))
library(tidyverse)
library(igraph)
graph.data.frame(dat) %>%
components() %>%
pluck(membership) %>%
stack() %>%
set_names(c('GRP', 'user_id')) %>%
right_join(dat %>% mutate(user_id = as.factor(user_id)), by = c('user_id'))
GRP user_id phone_number
1 1 101 4030201
2 1 102 4030201
3 1 102 4030202
4 1 103 4030202
5 1 103 4030203
6 2 106 4030204
on dat
given in comments, it gives
GRP user_id phone_number
1 1 101 4030201
2 1 102 4030201
3 1 102 4030202
4 1 103 4030202
5 1 103 4030203
6 2 106 4030204
7 3 107 4030205
8 1 111 4030203
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With