Basically, I have the following 2 data.table
s:
dt
- Contains a value field (y
) and a grouping field (x
)bk
- Contains 4 "break" fields (bn
) describing bucket structure for the interval [1, inf) for each group x
found in dt
. Each bn
represents the minimum value (inclusive) for the bucket and extends until the next bucket (ex: the 4 buckets for x
=1 are [1,3), [3,5), [5, 10), [10, inf)). Note that the bucket structures are not necessarily unique.
> #4 groups (x), each with a bucket structure defined breaks (bn).
> bk<- data.table(x=c(1:4), b1=c(1,1,1,1), b2=c(3,3,4,4), b3=c(5,5,7,8), b4=c(10,10,10,10), key="x")
> bk
x b1 b2 b3 b4
1: 1 1 3 5 10
2: 2 1 3 5 10
3: 3 1 4 7 10
4: 4 1 4 8 10
> dt<- data.table(x=rep(c(1:4),5), y=rep(c(1:10),2), key="x")
> dt
x y
1: 1 1
2: 1 5
3: 1 9
4: 1 3
5: 1 7
6: 2 2
7: 2 6
8: 2 10
9: 2 4
10: 2 8
11: 3 3
12: 3 7
13: 3 1
14: 3 5
15: 3 9
16: 4 4
17: 4 8
18: 4 2
19: 4 6
20: 4 10
My goal is to add a field b
to dt
that indicates which bucket (1, 2, 3, or 4) the record falls in based on the bucket structure corresponding to the group x
. See the desired output below:
x y b
1: 1 1 1 #Buckets for x=1
2: 1 5 3
3: 1 9 3
4: 1 3 2
5: 1 7 3
6: 2 2 1 #Buckets for x=2 (same as 1)
7: 2 6 3
8: 2 10 4
9: 2 4 2
10: 2 8 3
11: 3 3 1 #Buckets for x=3
12: 3 7 3
13: 3 1 1
14: 3 5 2
15: 3 9 3
16: 4 4 2 #Buckets for x=4
17: 4 8 3
18: 4 2 1
19: 4 6 2
20: 4 10 4
My initial idea was to join the two data.table
s and use the cut
function to return the bucket number for each record, however I am having trouble with the break
argument. First attempt shown below:
> bkt[dt, .(x, y, b=cut(y, breaks=c(b1, b2, b3, b4, "inf"), include.lowest=TRUE, labels=c(1:4)))]
Error in cut.default(y, breaks = c(b1, b2, b3, b4, "inf"), include.lowest = TRUE, :
'breaks' are not unique
If I create a variable a
to hold the bucket structure (say, for x
=1 for example), the following works as I had expected:
> a<- c(1, 3, 5, 10, "inf")
> bkt[dt, .(x, y, b=cut(y, breaks=a, include.lowest=TRUE, labels=c(1:4)))]
x y b
1: 1 1 1
2: 1 5 2
3: 1 9 3
4: 1 3 1
5: 1 7 3
6: 2 2 1
7: 2 6 3
8: 2 10 3
9: 2 4 2
10: 2 8 3
11: 3 3 1
12: 3 7 3
13: 3 1 1
14: 3 5 2
15: 3 9 3
16: 4 4 2
17: 4 8 3
18: 4 2 1
19: 4 6 3
20: 4 10 3
This is still not a practical solution for my applications, but I'm hoping that someone can help me understand how to deliver the bucket structure information to the breaks
argument properly to get a similar result. I have tried various combinations of the c
, list
, unlist
, as.numeric
functions to pass the correct break
argument but have had no luck. Any help/insight would be greatly appreciated. Thanks!
Full disclosure, I am new to R and this is my first post to please be gentle.
Change the join syntax a little:
dt[bk, v :=
cut(y, breaks = c(b1, b2, b3, b4, Inf), include.lowest = TRUE, labels = 1:4)
, by=.EACHI]
x y v
1: 1 1 1
2: 1 5 2
3: 1 9 3
4: 1 3 1
5: 1 7 3
6: 2 2 1
7: 2 6 3
8: 2 10 3
9: 2 4 2
10: 2 8 3
11: 3 3 1
12: 3 7 2
13: 3 1 1
14: 3 5 2
15: 3 9 3
16: 4 4 1
17: 4 8 2
18: 4 2 1
19: 4 6 2
20: 4 10 3
These results differ from the desired output in the OP, but I think the mistake lies in the interpretation of the cut
arguments (which I find confusing).
This approach is pretty inelegant, since each b1, ..., b4 has to be written; and it won't scale well as more break points are added. I agree with @NathanWerth's suggestion to reshape the bk
table. His approach also gives the OP's desired result without fiddling with cut
arguments.
Side note: The correct symbol is Inf
not "inf"
.
You can use melt.data.table
to restructure the bk
dataset into an easier form:
bk_long <- melt.data.table(
bk,
id.vars = 'x',
measure.vars = paste0('b', 1:4),
value.name = 'y'
)
setkey(bk_long, x)
bk_long[, variable := NULL]
bk_long[, b := seq_len(.N), by = x]
bk_long
# x y b
# 1: 1 1 1
# 2: 1 3 2
# 3: 1 5 3
# 4: 1 10 4
# 5: 2 1 1
# 6: 2 3 2
# 7: 2 5 3
# 8: 2 10 4
# 9: 3 1 1
# 10: 3 4 2
# 11: 3 7 3
# 12: 3 10 4
# 13: 4 1 1
# 14: 4 4 2
# 15: 4 8 3
# 16: 4 10 4
Then do a rolling join, as Frank suggested:
bk_long[dt, on = c('x', 'y'), roll = TRUE]
# x y b
# 1: 1 1 1
# 2: 1 5 3
# 3: 1 9 3
# 4: 1 3 2
# 5: 1 7 3
# 6: 2 2 1
# 7: 2 6 3
# 8: 2 10 4
# 9: 2 4 2
# 10: 2 8 3
# 11: 3 3 1
# 12: 3 7 3
# 13: 3 1 1
# 14: 3 5 2
# 15: 3 9 3
# 16: 4 4 2
# 17: 4 8 3
# 18: 4 2 1
# 19: 4 6 2
# 20: 4 10 4
After a number of tries, I finally got findInterval
to work.
The method is similar to frank's except that by
is used per group rather than in a single join. Using the built-in value, .BY
, you can iterate the rows of bk that are fed to the second argument (vec) of findInterval
.
dt[, b := findInterval(y, c(unlist(bk[.BY, b1:b4]), Inf), rightmost.closed=FALSE), by=x]
which returns
dt
x y b
1: 1 1 1
2: 1 5 3
3: 1 9 3
4: 1 3 2
5: 1 7 3
6: 2 2 1
7: 2 6 3
8: 2 10 4
9: 2 4 2
10: 2 8 3
11: 3 3 1
12: 3 7 3
13: 3 1 1
14: 3 5 2
15: 3 9 3
16: 4 4 2
17: 4 8 3
18: 4 2 1
19: 4 6 2
20: 4 10 4
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With