Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to exact match two column values in entire Dataset using R

I have below-mentioned two dataframe in R, and I have tried various method but couldn't achieve the required output yet.

DF:

ID     Date                 city        code    uid
I-1    2020-01-01 10:12:15  New York     123    K-1
I-1    2020-01-01 10:12:15  Utha         103    K-1
I-2    2020-01-02 10:12:15  Washington   122    K-1
I-3    2020-02-01 10:12:15  Tokyo        123    K-2
I-3    2020-02-01 10:12:15  Osaka        193    K-2
I-4    2020-02-02 10:12:15  London       144    K-3
I-5    2020-02-04 10:12:15  Dubai        101    K-4
I-6    2019-11-01 10:12:15  Dubai        101    K-4
I-7    2019-11-01 10:12:15  London       144    K-3
I-8    2018-12-13 10:12:15  Tokyo        143    K-5
I-9    2019-05-17 10:12:15  Dubai        101    K-4
I-19   2020-03-11 10:12:15  Dubai        150    K-7

Dput:

structure(list(ID = c("I-1", "I-1", 
"I-2", "I-3", "I-3", "I-4", 
"I-5", "I-6", "I-7", "I-8", "I-9","I-19" 
), DATE = c("2020-01-01 11:49:40.842", "2020-01-01 09:35:33.607", 
"2020-01-02 06:14:58.731", "2020-02-01 16:51:27.190", "2020-02-01 05:35:46.952", 
"2020-02-02 05:48:49.443", "2020-02-04 10:00:41.616", "2019-11-01 09:10:46.536", 
"2019-11-01 11:54:05.655", "2018-12-13 14:24:31.617", "2019-05-17 14:24:31.617", "2020-03-11 14:24:31.617"), CITY = c("New York", 
"UTAH", "Washington", "Tokyo", 
"Osaka", "London", "Dubai", 
"Dubai", "London", "Tokyo", "Dubai", 
"Dubai"), CODE = c("221010", 
"411017", "638007", "583101", "560029", "643102", "363001", "452001", 
"560024", "509208"), UID = c("K-1", 
"K-1", "K-1", "K-2", "K-2", 
"K-3", "K-4", "K-4", "K-3", 
"K-5","K-4","K-7")), .Names = c("ID", "DATE", 
"CITY", "CODE", "UID"), row.names = c(NA, 
10L), class = "data.fram)

Using the above-mentioned two dataframe, I want to fetch records between 1st Jan 2020 to 29th Feb 2002 and compare those ID in entire database to check whether both city and code together match with other ID and categorize it further to check how many have the same uid and how many have different.

Where,

  • Match - combination of city and code match with other ID in database
  • Same_uid - classification of Match ids to identify how many ID have similar uid
  • different_uid - classification of Match ids to identify how many ID doesn't have similar uid
  • uid_count - count of similar uid of that particular ID in entire database

Note - I have more than 10M records in the dataframe.

Required Output

ID      Date                  city         code   uid   Match   Same_uid   different_uid  uid_count
I-1     2020-01-01 10:12:15   New York     123    K-1    No      0          0              2
I-2     2020-01-02 10:12:15   Washington   122    K-1    No      0          0              2
I-3     2020-02-01 10:12:15   Tokyo        123    K-2    No      0          0              1   
I-4     2020-02-02 10:12:15   London       144    K-3    Yes     1          0              2
I-5     2020-02-04 10:12:15   Dubai        101    K-4    Yes     2          0              3 
like image 536
Sophia Wilson Avatar asked Aug 05 '20 03:08

Sophia Wilson


1 Answers

An approach,

Load in the dataset

    library(tidyverse)
    library(lubridate)

   

   mydata <- tibble(
   ID = c("I-1","I-1",
          "I-2","I-3",
          "I-3","I-4",
          "I-5","I-6",
          "I-7","I-8",
          "I-9","I-19"),
   Date = c("2020-01-01", "2020-01-01",
            "2020-01-02", "2020-02-01",
            "2020-02-01", "2020-02-02",
            "2020-02-04", "2019-11-01", 
            "2019-11-01", "2018-12-13", 
            "2019-05-17", "2020-03-11"),
   city = c("New York", "Utha", 
            "Washington", "Tokyo", 
            "Osaka", "London", 
            "Dubai", "Dubai", 
            "London", "Tokyo", 
            "Dubai", "Dubai"),
   code = c("123", "103", "122", "123", "193, "144",
            "101", "101", "144", "143", "101", "150"),
   uid = c("K-1", "K-1", "K-1", "K-2", "K-2", "K-3",
           "K-4", "K-4", "K-3", "K-5", "K-4", "K-7"))

   mydata <- mydata %>% 
     mutate(Date = ymd(str_remove(Date, " .*")),
            code = as.character(code))

Where clause number 1

I use count from dplyr to count the codes by cities. Then case_when to further identify with a "Yes" or "No" as requested.

# This counts city and code, and fullfills your "Match" column requirement
startdate <- "2017-01-01"
enddate <-   "2020-03-29"
mydata %>% 
  filter(Date >= startdate,
         Date <= enddate) %>%
  count(city, code, name = "count_samecode") %>%   
  mutate(Match = case_when(
   count_samecode > 1 ~ "Yes",
                    T ~ "No")) %>%
  head()
# # A tibble: 6 x 4
#  city     code  count_samecode Match
# <chr>    <chr>          <int> <chr>
# 1 Dubai    101                3 Yes  
# 2 Dubai    150                1 No   
# 3 London   144                2 Yes  
# 4 New York 123                1 No   
# 5 Osaka    193                1 No   
# 6 Tokyo    123                1 No  

Where clause number 2

I will do the same with UID

mydata %>% 
  filter(Date >= startdate,
         Date <= enddate ) %>% 
  count(city, uid, name = "UIDs_#_filtered") %>%
  head()

# # A tibble: 6 x 3
# city     uid   `UIDs_#_filtered`
# <chr>    <chr>             <int>
# 1 Dubai    K-4                   3
# 2 Dubai    K-7                   1
# 3 London   K-3                   2
# 4 New York K-1                   1
# 5 Osaka    K-2                   1
# 6 Tokyo    K-2                   1

Where clause number 3

I can repeat the count of clause number 2 to find how many of these cities have a different UID, where > 1 signals a different UID.

mydata %>% 
  filter(Date >= startdate,
         Date <= enddate ) %>% 
  count(city, uid, name = "UIDs_#_filtered") %>% 
  count(city, name = "UIDs_#_different") %>% 
  head()
# # A tibble: 6 x 2
# city     `UIDs_#_different`
# <chr>                 <int>
# 1 Dubai                     2
# 2 London                    1
# 3 New York                  1
# 4 Osaka                     1
# 5 Tokyo                     2
# 6 Utha                      1

Where clause number 4

Taking the same code from #2, I can eliminate the filter to find the entire dataset

mydata %>% 
  count(city, uid, name = "UIDs_#_all") %>% 
  head()

Putting it all together

Using several left_join's we can get closer to your desired output. EDIT: Now will bring the first instance of the ID from the first City / Code combination

check_duplicates_filterview.f <- function( df, startdate, enddate ){
  # df should be a tibble
  # startdate should be a string "yyyy-mm-dd"
  # enddate should be a string   "yyyy-mm-dd"
  cityfilter <- df %>% filter(Date >= startdate,
                              Date <= enddate) %>% distinct(city) %>% pull(1)
  df <- df %>% 
    filter(city %in% cityfilter) %>% 
    mutate(Date = ymd(str_remove(Date, " .*")),
           code = as.character(code))
  entire.db.countcodes <- df %>%              # Finds count of code in entire DB
    count(city, code) 
  where.1 <- df %>% filter(Date >= startdate, 
                     Date <= enddate) %>% 
    distinct(city, code, .keep_all = T) %>%
    left_join(entire.db.countcodes)  %>% 
    rename("count_samecode" = n) %>% 
    mutate(Match = case_when(
      count_samecode > 1 ~ "Yes",
      T ~ "No"))
  
  where.2 <- df %>% 
    filter(Date >= startdate,
           Date <= enddate ) %>% 
    count(city, uid, name = "UIDs_#_filtered")
  where.3 <- df %>% 
    filter(Date >= startdate,
           Date <= enddate ) %>% 
    distinct(city, uid) %>% 
    count(city, name = "UIDs_#_distinct") 
  where.4 <- df %>% 
    filter(city %in% cityfilter) %>% 
    count(city, uid, name = "UIDs_#_all")
  first_half <- left_join(where.1, where.2)
  second_half <- left_join(where.4, where.3)
  full <- left_join(first_half, second_half)
  return(full)
}


# > check_duplicates_filterview.f(mydata, "2018-01-01", "2020-01-01")
# Joining, by = "city"
# Joining, by = "city"
# Joining, by = c("city", "uid")
# # A tibble: 5 x 8
# city     code  count_samecode Match uid   `UIDs_#_filtered` `UIDs_#_all` `UIDs_#_distinct`
# <chr>    <chr>          <int> <chr> <chr>             <int>        <int>             <int>
# 1 Dubai    101                2 Yes   K-4                   2            3                 1
# 2 London   144                1 No    K-3                   1            2                 1
# 3 New York 123                1 No    K-1                   1            1                 1
# 4 Tokyo    143                1 No    K-5                   1            1                 1
# 5 Utha     103                1 No    K-1                   1            1                 1
like image 190
Bryan Wammack Avatar answered Nov 19 '22 19:11

Bryan Wammack