Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Find all sequences with the same column value

Tags:

r

I have the following data frame:

╔══════╦═════════╗
║ Code ║ Airline ║
╠══════╬═════════╣
║    1 ║ AF      ║
║    1 ║ KL      ║
║    8 ║ AR      ║
║    8 ║ AZ      ║
║    8 ║ DL      ║
╚══════╩═════════╝

dat <- structure(list(Code = c(1L, 1L, 8L, 8L, 8L), Airline = structure(c(1L, 
5L, 2L, 3L, 4L), .Label = c("AF  ", "AR  ", "AZ  ", "DL", "KL  "
), class = "factor")), .Names = c("Code", "Airline"), class = "data.frame", row.names = c(NA, 
-5L))

My goal is for each airline to find all shared codes, i.e. the codes used by one or more other airlines. So the output would be

+--------------------+
| Airline SharedWith |
+--------------------+
| AF      "KL"       |
| KL      "AF"       |
| AR      "AZ","DL"  |
+--------------------+

the pseudocode is any imperative language would be

for each code
  lookup all rows in the table where the value = code

Since R is not that much list oriented, what would be the best way to achieve the expected output?

like image 981
Andrei Varanovich Avatar asked Apr 24 '16 18:04

Andrei Varanovich


2 Answers

Several options using the data.table package:

1) Using strsplit, paste & operate by row:

library(data.table)
setDT(dat)[, Airline := trimws(Airline)  # this step is needed to remove the leading and trailing whitespaces
           ][, sharedwith := paste(Airline, collapse = ','), Code
            ][, sharedwith := paste(unlist(strsplit(sharedwith,','))[!unlist(strsplit(sharedwith,',')) %in% Airline], 
                                    collapse = ','), 1:nrow(dat)]

which gives:

> dat
   Code Airline sharedwith
1:    1      AF         KL
2:    1      KL         AF
3:    8      AR      AZ,DL
4:    8      AZ      AR,DL
5:    8      DL      AR,AZ

2) Using strsplit & paste with mapply instead of by = 1:nrow(dat):

setDT(dat)[, Airline := trimws(Airline)
           ][, sharedwith := paste(Airline, collapse = ','), Code
             ][, sharedwith := mapply(function(s,a) paste(unlist(strsplit(s,','))[!unlist(strsplit(s,',')) %in% a], 
                                                          collapse = ','),
                                      sharedwith, Airline)][]

which will give you the same result.

3) Or by using the CJ function with paste (inspired by the expand.grid solution of @zx8754):

library(data.table)
setDT(dat)[, Airline := trimws(Airline)
           ][, CJ(air=Airline, Airline,  unique=TRUE)[air!=V2][, .(shared=paste(V2,collapse=',')), air],
             Code]

which gives:

   Code air shared
1:    1  AF     KL
2:    1  KL     AF
3:    8  AR  AZ,DL
4:    8  AZ  AR,DL
5:    8  DL  AR,AZ

A solution with dplyr & tidyr to get the desired solution (inspired by @jaimedash):

library(dplyr)
library(tidyr)

dat <- dat %>% mutate(Airline = trimws(as.character(Airline)))

dat %>%
  mutate(SharedWith = Airline) %>% 
  group_by(Code) %>%
  nest(-Code, -Airline, .key = SharedWith) %>%
  left_join(dat, ., by = 'Code') %>%
  unnest() %>%
  filter(Airline != SharedWith) %>%
  group_by(Code, Airline) %>%
  summarise(SharedWith = toString(SharedWith))

which gives:

   Code Airline SharedWith
  (int)   (chr)      (chr)
1     1      AF         KL
2     1      KL         AF
3     8      AR     AZ, DL
4     8      AZ     AR, DL
5     8      DL     AR, AZ
like image 77
Jaap Avatar answered Nov 04 '22 05:11

Jaap


An an igraph approach

library(igraph)

g <- graph_from_data_frame(dat)

# Find neighbours for select nodes
ne <- setNames(ego(g,2, nodes=as.character(dat$Airline), mindist=2), dat$Airline)
ne
#$`AF  `
#+ 1/7 vertex, named:
#[1] KL  

#$`KL  `
#+ 1/7 vertex, named:
#[1] AF  
---
---

# Get final format
data.frame(Airline=names(ne), 
           Shared=sapply(ne, function(x)
                                      paste(V(g)$name[x], collapse=",")))
#   Airline Shared
# 1      AF     KL
# 2      KL     AF
# 3      AR  AZ,DL
# 4      AZ  AR,DL
# 5      DL  AR,AZ
like image 9
user20650 Avatar answered Nov 04 '22 04:11

user20650