I have a data set containing some groups and I want to calculate the number of records in each group, where a certain condition is met. I then want to expand the result to the rest of the records within each group (i.e. where the condition is not met) because I am collapsing the table later.
I'm using data.table to do this, and the .N
function to calculate the number of records within each group that meet my condition. I then get the max of all the values within each group to apply the result to all records within each group. My data set is quite large (nearly 5 million records).
I keep getting the following error:
Error in `[.data.table`(dpart, , `:=`(clustersize4wk, max(clustersize4wk, :
Type of RHS ('double') must match LHS ('integer'). To check and coerce would impact performance too much for the fastest cases. Either change the type of the target column, or coerce the RHS of := yourself (e.g. by using 1L instead of 1)
At first, I assumed that using .N
was producing an integer, whereas getting the max of the values by group was producing a double, however this does not seem to be the case (in the toy example below, the class of the results column remains as integer throughout) and I'm unable to reproduce the problem.
For illustration, here is an example:
# Example data:
mydt <- data.table(id = c("a", "a", "b", "b", "b", "c", "c", "c", "c", "d", "d", "d"),
grp = c("G1", "G1", "G1", "G1", "G1", "G2", "G2", "G2", "G2", "G2", "G2", "G2"),
name = c("Jack", "John", "Jill", "Joe", "Jim", "Julia", "Simran", "Delia", "Aurora", "Daniele", "Joan", "Mary"),
sex = c("m", "m", "f", "m", "m", "f", "m", "f", "f", "f", "f", "f"),
age = c(2,12,29,15,30,75,5,4,7,55,43,39),
reportweek = c("201740", "201750", "201801", "201801", "201801", "201748", "201748", "201749", "201750", "201752", "201752", "201801"))
I am calculating the number within each group that are male like this:
mydt[sex == "m", csize := .N, by = id]
> is.integer(mydt$csize)
[1] TRUE
> is.double(mydt$csize)
[1] FALSE
Some groups do not contain any males, so to avoid getting Inf
in the next step I recode NA as 0:
mydt[ is.na(csize), csize := 0]
I then expand the result to all members within each group like this:
mydt[, csize := max(csize, na.rm = T), by = id]
> is.integer(mydt$csize)
[1] TRUE
> is.double(mydt$csize)
[1] FALSE
This is the point at which the error appears in my real data set. If I omit the step to recode NAs to 0 I can reproduce the error with the example data; otherwise not. Also with my real data set (in spite of having recoded NAs to 0) I still get the following warning:
19: In max(clustersize4wk, na.rm = TRUE) :
no non-missing arguments to max; returning -Inf
How can I resolve this?
My expected output is below:
> mydt
id grp name sex age reportweek csize
1: a G1 Jack m 2 201740 2
2: a G1 John m 12 201750 2
3: b G1 Jill f 29 201801 2
4: b G1 Joe m 15 201801 2
5: b G1 Jim m 30 201801 2
6: c G2 Julia f 75 201748 1
7: c G2 Simran m 5 201748 1
8: c G2 Delia f 4 201749 1
9: c G2 Aurora f 7 201750 1
10: d G2 Daniele f 55 201752 0
11: d G2 Joan f 43 201752 0
12: d G2 Mary f 39 201801 0
The actual problem is that datatype of the csize
. Its of type integer
. The max
returns double
type.
The fix could be:
mydt[sex == "m", csize := as.double(.N), by = id]
mydt[, csize := max(csize, 0, na.rm = TRUE), by = id]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With