Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

r data.table avoid class discrepancy between RHS and LHS

I have a data set containing some groups and I want to calculate the number of records in each group, where a certain condition is met. I then want to expand the result to the rest of the records within each group (i.e. where the condition is not met) because I am collapsing the table later.

I'm using data.table to do this, and the .N function to calculate the number of records within each group that meet my condition. I then get the max of all the values within each group to apply the result to all records within each group. My data set is quite large (nearly 5 million records).

I keep getting the following error:

  Error in `[.data.table`(dpart, , `:=`(clustersize4wk, max(clustersize4wk,  : 
  Type of RHS ('double') must match LHS ('integer'). To check and coerce would impact performance too much for the fastest cases. Either change the type of the target column, or coerce the RHS of := yourself (e.g. by using 1L instead of 1)

At first, I assumed that using .N was producing an integer, whereas getting the max of the values by group was producing a double, however this does not seem to be the case (in the toy example below, the class of the results column remains as integer throughout) and I'm unable to reproduce the problem.

For illustration, here is an example:

# Example data:

mydt <- data.table(id = c("a", "a", "b", "b", "b", "c", "c", "c", "c", "d", "d", "d"),
                   grp = c("G1", "G1", "G1", "G1", "G1", "G2", "G2", "G2", "G2", "G2", "G2", "G2"),
                   name = c("Jack", "John", "Jill", "Joe", "Jim", "Julia", "Simran", "Delia", "Aurora", "Daniele", "Joan", "Mary"),
                   sex = c("m", "m", "f", "m", "m", "f", "m", "f", "f", "f", "f", "f"), 
                   age = c(2,12,29,15,30,75,5,4,7,55,43,39), 
                   reportweek = c("201740", "201750", "201801", "201801", "201801", "201748", "201748", "201749", "201750", "201752", "201752", "201801"))

I am calculating the number within each group that are male like this:

mydt[sex == "m", csize := .N, by = id]

> is.integer(mydt$csize)
[1] TRUE
> is.double(mydt$csize)
[1] FALSE

Some groups do not contain any males, so to avoid getting Inf in the next step I recode NA as 0:

mydt[ is.na(csize), csize := 0]

I then expand the result to all members within each group like this:

mydt[, csize := max(csize, na.rm = T), by = id] 

> is.integer(mydt$csize)
[1] TRUE
> is.double(mydt$csize)
[1] FALSE

This is the point at which the error appears in my real data set. If I omit the step to recode NAs to 0 I can reproduce the error with the example data; otherwise not. Also with my real data set (in spite of having recoded NAs to 0) I still get the following warning:

19: In max(clustersize4wk, na.rm = TRUE) :
  no non-missing arguments to max; returning -Inf 

How can I resolve this?

My expected output is below:

> mydt
    id grp    name sex age reportweek csize
 1:  a  G1    Jack   m   2     201740     2
 2:  a  G1    John   m  12     201750     2
 3:  b  G1    Jill   f  29     201801     2
 4:  b  G1     Joe   m  15     201801     2
 5:  b  G1     Jim   m  30     201801     2
 6:  c  G2   Julia   f  75     201748     1
 7:  c  G2  Simran   m   5     201748     1
 8:  c  G2   Delia   f   4     201749     1
 9:  c  G2  Aurora   f   7     201750     1
10:  d  G2 Daniele   f  55     201752     0
11:  d  G2    Joan   f  43     201752     0
12:  d  G2    Mary   f  39     201801     0
like image 385
Amy M Avatar asked Oct 16 '22 21:10

Amy M


1 Answers

The actual problem is that datatype of the csize. Its of type integer. The max returns double type.

The fix could be:

mydt[sex == "m", csize := as.double(.N), by = id]

mydt[, csize := max(csize, 0, na.rm = TRUE), by = id]
like image 126
MKR Avatar answered Oct 21 '22 02:10

MKR