Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Fastest way to check for unique values and returning it if there is only one unique value in an R data.table

Tags:

r

data.table

Suppose I have a large data.table that looks like dt below.

dt <- data.table(
  player_1 = c("a", "b", "b", "c"),
  player_1_age = c(10, 20, 20, 30),
  player_2 = c("b", "a", "c", "a"),
  player_2_age = c(20, 10, 30, 10)
)
# dt
#    player_1 player_1_age player_2 player_2_age
# 1:        a           10        b           20
# 2:        b           20        a           10
# 3:        b           20        c           30
# 4:        c           30        a           10

From the dt above, I would like to create a data.table with unique players and their age like the following, player_dt:

# player_dt
# player  age
#      a   10
#      b   20
#      c   30

To do so, I've tried the code below, but it takes too long on my larger dataset, probably because I am creating a data.table for each iteration of sapply.

How would you get the player_dt above, while checking for each player that there is only one unique age value?

# get unique players
player <- sort(unique(c(dt$player_1, dt$player_2)))

# for each player, get their age, if there is only one age value
age <- sapply(player, function(x) {
  unique_values <- unique(c(
    dt[player_1 == x][["player_1_age"]],
    dt[player_2 == x][["player_2_age"]]))
  if(length(unique_values) > 1) stop() else return(unique_values)
})

# combine to create the player_dt
player_dt <- data.table(player, age)
like image 966
johnny Avatar asked May 02 '20 17:05

johnny


2 Answers

I use the data from @DavidT as input.

dt
#   player_1 player_1_age player_2 player_2_age
#1:        a           10        b           20
#2:        b           20        a           10
#3:        b           20        c           30
#4:        c           30        a           11 # <--

TL;DR

You can do

nm <- names(dt)
idx <- endsWith(nm, "age")
colsAge <- nm[idx]
colsOther <- nm[!idx]

out <-
  unique(melt(
    dt,
    measure.vars = list(colsAge, colsOther),
    value.name = c("age", "player")
  )[, .(age, player)])[, if (.N == 1) # credit: https://stackoverflow.com/a/34427944/8583393
    .SD, by = player]
out
#   player age
#1:      b  20
#2:      c  30

Step-by-step

What you can to do is to melt multiple columns simultaneously - those that end with "age" and those that don't.

nm <- names(dt)
idx <- endsWith(nm, "age")
colsAge <- nm[idx]
colsOther <- nm[!idx]
dt1 <- melt(dt, measure.vars = list(colsAge, colsOther), value.name = c("age", "player"))

The result is

dt1
#   variable age player
#1:        1  10      a
#2:        1  20      b
#3:        1  20      b
#4:        1  30      c
#5:        2  20      b
#6:        2  10      a
#7:        2  30      c
#8:        2  11      a

Now we call unique ...

out <- unique(dt1[, .(age, player)])
out
#   age player
#1:  10      a
#2:  20      b
#3:  30      c
#4:  11      a

... and filter for groups of "player" with length equal to 1

out <- out[, if(.N == 1) .SD, by=player]
out
#   player age
#1:      b  20
#2:      c  30

Given OP's input data, that last step is not needed.

data

library(data.table)
dt <- data.table(
  player_1 = c("a", "b", "b", "c"),
  player_1_age = c(10, 20, 20, 30),
  player_2 = c("b", "a", "c", "a"),
  player_2_age = c(20, 10, 30, 11)
)

Reference: https://cran.r-project.org/web/packages/data.table/vignettes/datatable-reshape.html

like image 148
markus Avatar answered Nov 20 '22 19:11

markus


I've altered your data so that there's at least one error to catch:

library(tidyverse)

dt <- tibble(
  player_1 = c("a", "b", "b", "c"),
  player_1_age = c(10, 20, 20, 30),
  player_2 = c("b", "a", "c", "a"),
  player_2_age = c(20, 10, 30, 11)
)
  # Get the Names columns and the Age columns
colName <- names(dt)
ageCol <- colName[str_detect(colName, "age$")]
playrCol <- colName[! str_detect(colName, "age$")]

  # Gather the Ages
ages <- dt %>% 
  select(ageCol) %>% 
  gather(player_age, age)

  # Gather the names
names <- dt %>% 
  select(playrCol ) %>% 
  gather(player_name, name)

  # Bind the two together, and throw out the duplicates
  # If there are no contradictions, this is what you want.
allNameAge <- cbind( names, ages) %>% 
  select(name, age) %>% 
  distinct() %>% 
  arrange(name)

  # But check for inconsistencies.  This should leave you with
  # an empty tibble, but instead it shows the error.
inconsistencies <- allNameAge %>% 
  group_by(name) %>% 
  mutate(AGE.COUNT = n_distinct(age)) %>% 
  filter(AGE.COUNT > 1) %>% 
  ungroup()

This should extends to more name/age column pairs.

like image 28
David T Avatar answered Nov 20 '22 18:11

David T