Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

R: Referencing data.table fields in cut function in j clause

Tags:

r

data.table

Basically, I have the following 2 data.tables:

dt - Contains a value field (y) and a grouping field (x)
bk - Contains 4 "break" fields (bn) describing bucket structure for the interval [1, inf) for each group x found in dt. Each bn represents the minimum value (inclusive) for the bucket and extends until the next bucket (ex: the 4 buckets for x=1 are [1,3), [3,5), [5, 10), [10, inf)). Note that the bucket structures are not necessarily unique.

> #4 groups (x), each with a bucket structure defined breaks (bn).
> bk<- data.table(x=c(1:4), b1=c(1,1,1,1), b2=c(3,3,4,4), b3=c(5,5,7,8), b4=c(10,10,10,10), key="x")
> bk
   x b1 b2 b3 b4
1: 1  1  3  5 10
2: 2  1  3  5 10
3: 3  1  4  7 10
4: 4  1  4  8 10
> dt<- data.table(x=rep(c(1:4),5), y=rep(c(1:10),2), key="x")
> dt
    x  y
 1: 1  1
 2: 1  5
 3: 1  9
 4: 1  3
 5: 1  7
 6: 2  2
 7: 2  6
 8: 2 10
 9: 2  4
10: 2  8
11: 3  3
12: 3  7
13: 3  1
14: 3  5
15: 3  9
16: 4  4
17: 4  8
18: 4  2
19: 4  6
20: 4 10

My goal is to add a field b to dt that indicates which bucket (1, 2, 3, or 4) the record falls in based on the bucket structure corresponding to the group x. See the desired output below:

    x  y  b
 1: 1  1  1  #Buckets for x=1
 2: 1  5  3
 3: 1  9  3
 4: 1  3  2
 5: 1  7  3
 6: 2  2  1  #Buckets for x=2 (same as 1)
 7: 2  6  3
 8: 2 10  4
 9: 2  4  2
10: 2  8  3
11: 3  3  1  #Buckets for x=3
12: 3  7  3
13: 3  1  1
14: 3  5  2
15: 3  9  3
16: 4  4  2  #Buckets for x=4
17: 4  8  3
18: 4  2  1
19: 4  6  2
20: 4 10  4

My initial idea was to join the two data.tables and use the cut function to return the bucket number for each record, however I am having trouble with the break argument. First attempt shown below:

> bkt[dt, .(x, y, b=cut(y, breaks=c(b1, b2, b3, b4, "inf"), include.lowest=TRUE, labels=c(1:4)))]
Error in cut.default(y, breaks = c(b1, b2, b3, b4, "inf"), include.lowest = TRUE,  : 
  'breaks' are not unique

If I create a variable a to hold the bucket structure (say, for x=1 for example), the following works as I had expected:

> a<- c(1, 3, 5, 10, "inf")
> bkt[dt, .(x, y, b=cut(y, breaks=a, include.lowest=TRUE, labels=c(1:4)))]
    x  y b
 1: 1  1 1
 2: 1  5 2
 3: 1  9 3
 4: 1  3 1
 5: 1  7 3
 6: 2  2 1
 7: 2  6 3
 8: 2 10 3
 9: 2  4 2
10: 2  8 3
11: 3  3 1
12: 3  7 3
13: 3  1 1
14: 3  5 2
15: 3  9 3
16: 4  4 2
17: 4  8 3
18: 4  2 1
19: 4  6 3
20: 4 10 3

This is still not a practical solution for my applications, but I'm hoping that someone can help me understand how to deliver the bucket structure information to the breaks argument properly to get a similar result. I have tried various combinations of the c, list, unlist, as.numeric functions to pass the correct break argument but have had no luck. Any help/insight would be greatly appreciated. Thanks!

Full disclosure, I am new to R and this is my first post to please be gentle.

like image 311
Jason Avatar asked Jul 27 '17 17:07

Jason


3 Answers

Change the join syntax a little:

dt[bk, v := 
  cut(y, breaks = c(b1, b2, b3, b4, Inf), include.lowest = TRUE, labels = 1:4)
, by=.EACHI]

    x  y v
 1: 1  1 1
 2: 1  5 2
 3: 1  9 3
 4: 1  3 1
 5: 1  7 3
 6: 2  2 1
 7: 2  6 3
 8: 2 10 3
 9: 2  4 2
10: 2  8 3
11: 3  3 1
12: 3  7 2
13: 3  1 1
14: 3  5 2
15: 3  9 3
16: 4  4 1
17: 4  8 2
18: 4  2 1
19: 4  6 2
20: 4 10 3

These results differ from the desired output in the OP, but I think the mistake lies in the interpretation of the cut arguments (which I find confusing).

This approach is pretty inelegant, since each b1, ..., b4 has to be written; and it won't scale well as more break points are added. I agree with @NathanWerth's suggestion to reshape the bk table. His approach also gives the OP's desired result without fiddling with cut arguments.


Side note: The correct symbol is Inf not "inf".

like image 194
Frank Avatar answered Oct 06 '22 00:10

Frank


You can use melt.data.table to restructure the bk dataset into an easier form:

bk_long <- melt.data.table(
  bk,
  id.vars = 'x',
  measure.vars = paste0('b', 1:4),
  value.name = 'y'
)
setkey(bk_long, x)
bk_long[, variable := NULL]
bk_long[, b := seq_len(.N), by = x]
bk_long
#     x  y b
#  1: 1  1 1
#  2: 1  3 2
#  3: 1  5 3
#  4: 1 10 4
#  5: 2  1 1
#  6: 2  3 2
#  7: 2  5 3
#  8: 2 10 4
#  9: 3  1 1
# 10: 3  4 2
# 11: 3  7 3
# 12: 3 10 4
# 13: 4  1 1
# 14: 4  4 2
# 15: 4  8 3
# 16: 4 10 4

Then do a rolling join, as Frank suggested:

bk_long[dt, on = c('x', 'y'), roll = TRUE]
#     x  y b
#  1: 1  1 1
#  2: 1  5 3
#  3: 1  9 3
#  4: 1  3 2
#  5: 1  7 3
#  6: 2  2 1
#  7: 2  6 3
#  8: 2 10 4
#  9: 2  4 2
# 10: 2  8 3
# 11: 3  3 1
# 12: 3  7 3
# 13: 3  1 1
# 14: 3  5 2
# 15: 3  9 3
# 16: 4  4 2
# 17: 4  8 3
# 18: 4  2 1
# 19: 4  6 2
# 20: 4 10 4
like image 25
Nathan Werth Avatar answered Oct 05 '22 23:10

Nathan Werth


After a number of tries, I finally got findInterval to work.

The method is similar to frank's except that by is used per group rather than in a single join. Using the built-in value, .BY, you can iterate the rows of bk that are fed to the second argument (vec) of findInterval.

dt[, b := findInterval(y, c(unlist(bk[.BY, b1:b4]), Inf), rightmost.closed=FALSE), by=x]

which returns

dt
    x  y b
 1: 1  1 1
 2: 1  5 3
 3: 1  9 3
 4: 1  3 2
 5: 1  7 3
 6: 2  2 1
 7: 2  6 3
 8: 2 10 4
 9: 2  4 2
10: 2  8 3
11: 3  3 1
12: 3  7 3
13: 3  1 1
14: 3  5 2
15: 3  9 3
16: 4  4 2
17: 4  8 3
18: 4  2 1
19: 4  6 2
20: 4 10 4
like image 27
lmo Avatar answered Oct 05 '22 23:10

lmo