Still learning R, but I've personally accepted that this is impossible, and I hope one of you can prove me wrong.
I'm looking to find the mean for values <= 25th percentile, and the mean for values >= 75th percentile; but not for the whole data set. I would like to find these means for subsets of data, from which the percentiles were found.
This will generate data that resembles my own:
library(data.table)
DT <- data.table(V1 <- c('AR','AR','AR','AR','AR','AR','AD','AD','AD','AD','AD','AD','BD',
'BD','BD','BD','BX','CX','DX','DX','DD','DD','DD','DD','DR','DR',
'DR','DR','DR','DR'),
V2 <- c(.12,.02,.03,.22,.44,.09,.11,.17,.15,.26,.29,.27,.16,.16,.02,.12,.02,
.03,.22,.44,.09,.11,.17,.15,.26,.29,.27,.16,.16,.02))
Looks like:
V1 V2
1: AR 0.12
2: AR 0.02
3: AR 0.03
4: AR 0.22
5: AR 0.44
6: AR 0.09
7: AD 0.11
8: AD 0.17
9: AD 0.15
10: AD 0.26
11: AD 0.29
12: AD 0.27
13: BD 0.16
14: BD 0.16
15: BD 0.02
16: BD 0.12
17: BX 0.02
18: CX 0.03
19: DX 0.22
20: DX 0.44
21: DD 0.09
22: DD 0.11
23: DD 0.17
24: DD 0.15
25: DR 0.26
26: DR 0.29
27: DR 0.27
28: DR 0.16
29: DR 0.16
30: DR 0.02
First step: Calulate median, 25th percentile, 75th percentile, and count appearances of each A_, B_, C_, D_. Got it:
dt.qtile <- DT[, list(Bottom = quantile(V2, .25),
Middle = quantile(V2, .5),
Top = quantile(V2, .75),
Appearances = .N), by = V1]
Produces:
V1 Bottom Middle Top Appearances
1: AR 0.045 0.105 0.1950 6
2: AD 0.155 0.215 0.2675 6
3: BD 0.095 0.140 0.1600 4
4: BX 0.020 0.020 0.0200 1
5: CX 0.030 0.030 0.0300 1
6: DX 0.275 0.330 0.3850 2
7: DD 0.105 0.130 0.1550 4
8: DR 0.160 0.210 0.2675 6
This is where I think it's impossible. I would like to find the values in the original V2 (DT$V2) that are less than or equal to the value of the 25th percentile, then greater than or equal to 75th percentile for each letter combination in V1.
V1 V2
1: AR 0.12 - Ignore -
2: AR 0.02 <= 0.045 \
3: AR 0.03 <= 0.045 / mean = 0.05 (Bottom)
4: AR 0.22 >= 0.1950 \
5: AR 0.44 >= 0.1950 / mean = 0.33 (Top)
6: AR 0.09 - Ignore -
------
7: AD 0.11 <= 0.155 > mean = 0.11 (Bottom)
8: AD 0.17 - Ignore -
9: AD 0.15 - Ignore -
10: AD 0.26 >= 0.2675 \
11: AD 0.29 >= 0.2675 | mean = 0.2733 (Top)
12: AD 0.27 >= 0.2675 /
...
25: DR 0.26 - Ignore -
26: DR 0.29 >= 0.2675 \
27: DR 0.27 >= 0.2675 / mean = 0.28 (Top)
28: DR 0.16 <= 0.16 \
29: DR 0.16 <= 0.16 | mean = 0.17 (Bottom)
30: DR 0.02 <= 0.16 /
Average the values in V2 that are <= 25th percentile, then average the values that are >= 75th percentile.
The new output should be something like this:
V1 Bottom Middle Top Appearances
1: AR 0.025 0.105 0.3300 6
2: AD 0.110 0.215 0.2733 6
...
8: DR 0.170 0.210 0.2800 6
This gets me close:
DT[V2 < quantile(V2, .25), mean(V2), by = V1]
But it's computing the quantile for the whole data set, not each letter combination.
So I try:
DT[V2 < DT[, quantile(V2, .25), by = V1], mean(V2), by = V1]
I get this:
Error in `[.data.table`(DT, V2 < DT[, quantile(V2, 0.25), by = V1], mean(V2), :
i is invalid type (matrix).
Perhaps in future a 2 column matrix could return a list of elements of DT
(in the spirit of A[B] in FAQ 2.14).
Please let datatable-help know if you'd like this, or add your comments to FR #657.
I know this has to be simple, but I just can't see it. What am I missing? Let me know where I can clarify.
I appreciate your help in advance!
DT[, list( Bottom = mean(V2[V2 <= quantile(V2, 0.25)]),
Middle = median(V2),
Top = mean(V2[V2 >= quantile(V2, 0.75)]),
Appearances = .N), by = V1]
I would never have found this on my own.
DT[, mean(V2[V2 < quantile(V2, 0.25)]), by = V1]
V1 V1
1: AR 0.025
2: AD 0.130
3: BD 0.020
4: BX NaN
5: CX NaN
6: DX 0.220
7: DD 0.090
8: DR 0.020
DT[, mean(V2[V2 > quantile(V2, 0.75)]), by = V1]
V1 V1
1: AR 0.33
2: AD 0.28
3: BD NaN
4: BX NaN
5: CX NaN
6: DX 0.44
7: DD 0.17
8: DR 0.28
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With