Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

R: Find the count of each unique value of a variable that occurs within timeframe of each observation

Tags:

I have a panel data set with observations for individuals over time. There are three columns: person, date, and value. I want to create three new variables. First: the number of unique entries in the value column for each person-day that occur within 1 year before the observation. Second: a set of dummies that reflect whether or not the person had each possible entry in the value column within the last year. Third: a set of count variables that count the number of times the person had each possible entry in the value column within the previous year.

I have a way to do this on a toy data set, but when I try to apply my method to a large data set, I run out of memory.

Is there a more efficient way to get these three variables?

My toy data set approach has 5 steps:

  1. create two new sets of variables that are the lagged dates and values for each person. I lag back as many times as the person with the most entries.
  2. I loop across the new variables and compare each one to the current date. If the current date is less than 1 year after the lags, I fill in the lagged date variable with the corresponding lagged value variable. I fill in instead of create a new column to save memory. If the lagged variable is out of range, I replace the lagged date with a string "OOR" for out of range.
  3. I paste together the lagged date variables. There will be some value entries and a bunch of "OOR". I remove the "OOR" with a regular expression. I separate all entries within the string with a unique character like "-"
  4. I loop over all possible entries in the value column and create a new variable for if the pasted string from step 3 matches the possible entry, and count the occurrences of the possible entry in the pasted string.
  5. I count the unique entries in the pasted string from step 3, keying off the separating string "-".
# In this example, "Species" = "person", "Sepal.Width" = "value"
# I use 3 days instead of 1 year

library(data.table)
library(zoo)
library(stringr)

set.seed(481516)
dt <- as.data.table(iris)
dt[, date := as.Date(seq_len(.N), format = "%Y"), by = Species]
dt[, sepal_width_above_3 := ifelse(Sepal.Width > 3, 1, 0)]
dt[, random_drop := runif(nrow(dt))]
dt <- dt[random_drop >= 0.1]

dt_lag <- dt[, .(Species, date, Sepal.Width)]

#step 1: lag the date - a new var for every species-date combo
dt_lag[, species_count := .N, by = Species]
keep_names <- names(dt_lag)[names(dt_lag) != "Species"]
dt_lag <- cbind(dt_lag[, ..keep_names], 
                dt_lag[, shift(date, n=1:max(dt_lag$species_count), give.names = T), by = Species],
                dt_lag[, shift(Sepal.Width, n=1:max(dt_lag$species_count), give.names = T), by = Species])


#step 2: which values do we keep?
for (X in names(dt_lag)[names(dt_lag) %like% "^date_lag_"]){
  Y <- gsub("date","Sepal.Width", X)
  dt_lag[, (X) := ifelse(date - get(X) > 0 & date - get(X) <= 3, get(Y) , "OOR")]
  dt_lag[, (Y) := NULL]
}

#step 3: paste together
dt_lag[, sepal_width_values_within_lag_3 := gsub('-NA|-OOR','',do.call(paste, c(.SD, sep="-"))), .SDcols = names(dt_lag) %like% "date_lag_"]
for (X in names(dt_lag)[names(dt_lag) %like% "^date_lag_"]){
  dt_lag[, (X) := NULL]
}

#step 4: counts and dummies for each type
for (X in sort(unique(dt_lag$Sepal.Width))) {
  # X = 1
  spec_count = paste("sepal_width_count_lag_365_",X,sep="")
  spec_dummy = paste("sepal_width_dummy_lag_365_",X,sep="")
  pattern = as.character(X)

  dt_lag[, (spec_count) := str_count(sepal_width_values_within_lag_3, eval(pattern))]
  dt_lag[, (spec_dummy) := str_detect(sepal_width_values_within_lag_3, eval(pattern))]
}

#step 5: unique counts
dt_lag[, unique_sepal_width_values_within_lag_3_count := sapply(sepal_width_values_within_lag_3, function(x)
  length(    # count items
    unique(   # that are unique
      scan(   # when arguments are presented to scan as text
        text=x, what="", sep ="-",  # when separated by ","
        quiet=TRUE)))  )]

Here are the results for 1 entry in the value column (width = 2)

head(dt_lag[,.(date, Species, sepal_width_values_within_lag_3, sepal_width_count_lag_365_2, sepal_width_dummy_lag_365_2, unique_sepal_width_values_within_lag_3_count)])

    date      Species sepal_width_values_within_lag_3 sepal_width_count_lag_365_2 sepal_width_dummy_lag_365_2 unique_sepal_width_values_within_lag_3_count
1: 1970-09-14  setosa                             3.5                           0                       FALSE                                            1
2: 1970-09-15  setosa                           3-3.5                           0                       FALSE                                            2
3: 1970-09-16  setosa                       3.2-3-3.5                           1                        TRUE                                            3
4: 1970-09-17  setosa                       3.1-3.2-3                           1                        TRUE                                            3
5: 1970-09-18  setosa                     3.6-3.1-3.2                           1                        TRUE                                            3
6: 1970-09-19  setosa                     3.9-3.6-3.1                           0                       FALSE                                            3
like image 782
gannawag Avatar asked Sep 13 '19 15:09

gannawag


People also ask

How do you count unique occurrences in R?

Method 1: Using length(unique()) function Unique() function when provided with a list will give out only the unique ones from it. Later length() function can calculate the frequency.

How do you count unique values in a data set?

You can use the combination of the SUM and COUNTIF functions to count unique values in Excel. The syntax for this combined formula is = SUM(IF(1/COUNTIF(data, data)=1,1,0)). Here the COUNTIF formula counts the number of times each value in the range appears.

How do you find the unique values in a column of an R data frame?

To find unique values in a column in a data frame, use the unique() function in R. In Exploratory Data Analysis, the unique() function is crucial since it detects and eliminates duplicate values in the data.

How do you count how many times a value is repeated in R?

Use the length() function to count the number of elements returned by the which() function, as which function returns the elements that are repeated more than once. The length() function in R Language is used to get or set the length of a vector (list) or other objects.


1 Answers

This is only a partial answer, since I did not fully understand your 2nd and 3rd problem...

#create data.table with the correct names, based on your sample data (i think)
DT <- dt[, .(person = Species, date, value = Sepal.Width)]
#set keys
setkey(DT, person, date)
#create unique values of `value in the last year before the observation, for each `person
DT[ DT, 
    #get the unique values for the last year, suppress immediate output with {}
    unique_values_prev_year := {
      val = DT[ person == i.person & date %between% c( i.date - lubridate::years(1), i.date) ]$value
      unique_val = sort( unique( val ) )
      list( paste0( unique_val, collapse = "-" ) )
      }, 
    #do the above for each row
    by = .EACHI ]

output

#         person       date value                           unique_values_prev_year
# 1:      setosa 1970-09-14   3.5                                               3.5
# 2:      setosa 1970-09-15   3.0                                             3-3.5
# 3:      setosa 1970-09-16   3.2                                         3-3.2-3.5
# 4:      setosa 1970-09-17   3.1                                     3-3.1-3.2-3.5
# 5:      setosa 1970-09-19   3.9                                 3-3.1-3.2-3.5-3.9
# ---                                                                             
# 133: virginica 1970-10-28   3.3 2.2-2.5-2.6-2.7-2.8-2.9-3-3.1-3.2-3.3-3.4-3.6-3.8
# 134: virginica 1970-10-29   3.0 2.2-2.5-2.6-2.7-2.8-2.9-3-3.1-3.2-3.3-3.4-3.6-3.8
# 135: virginica 1970-10-30   2.5 2.2-2.5-2.6-2.7-2.8-2.9-3-3.1-3.2-3.3-3.4-3.6-3.8
# 136: virginica 1970-10-31   3.0 2.2-2.5-2.6-2.7-2.8-2.9-3-3.1-3.2-3.3-3.4-3.6-3.8
# 137: virginica 1970-11-01   3.4 2.2-2.5-2.6-2.7-2.8-2.9-3-3.1-3.2-3.3-3.4-3.6-3.8
like image 74
Wimpel Avatar answered Dec 07 '22 00:12

Wimpel