Ok, this has me absolutely perplexed and worried-
As part of a routine, I have been classifying individual observations of variables as TRUE
or FALSE
based on whether their values are above or below/equal to the median value. However, I have been getting a behavior in R that is largely unexpected from performing this simple test.
So take this set of observations:
data=c(0.6666667, 0.8333, 0.6666667, 0.8333, 0.8333, 0.75, 0.9999, 0.7499667, 0.25, 0.6666667, 0.1667, 0.7499667, 0.5, 0.2500333, 0.3333667, 0.0834, 0.0001, 0.2500333, 0.8333, 0.9999, 0.9999, 0.2500333, 0.2500333, 0.3333667, 0.9166, 0.5, 0.2500333, 0.4166667, 0.0001, 0.1667333, 0.6666333, 0.0834, 0.1667, 0.6666333, 0.9166, 0.1667, 0.7499333, 0.9166, 0.9166, 0.9166, 0.7499667, 0.7499667, 0.4166667, 0.5, 0.2500333, 0.9166, 0.6666667, 0.1667333, 0.25, 0.0001, 0.3333667, 0.0001, 0.25, 0.0834, 0.9999, 0.0834, 0.1667, 0.5, 0.2500333, 0.3333667, 0.9166, 0.9166, 0.8333, 0.9166, 0.75, 0.0834, 0.4166667, 0.5, 0.0001, 0.9999, 0.8333, 0.6666667, 0.9166)
For me to classify these values, I did:
data_med=median(data)
quant_data=data
quant_data[quant_data>data_med]="High"
quant_data[quant_data<=data_med]="Low"
I know there are 1 gazillion ways of doing this more efficiently, but what has me worried is that the output from this does not make sense. Since there are no NaN
s on the set and the test is all inclusive (>
or <=
), I should end up with a list of only TRUE
/FALSE
values, but instead I get:
[1] "High" "High" "High" "High" "High" "High" "High" "High" "Low" "High" "Low" "High" "Low" "Low" "Low" "Low" "1e-04"
[18] "Low" "High" "High" "High" "Low" "Low" "Low" "High" "Low" "Low" "Low" "1e-04" "Low" "High" "Low" "Low" "High"
[35] "High" "Low" "High" "High" "High" "High" "High" "High" "Low" "Low" "Low" "High" "High" "Low" "Low" "1e-04" "Low"
[52] "1e-04" "Low" "Low" "High" "Low" "Low" "Low" "Low" "Low" "High" "High" "High" "High" "High" "Low" "Low" "Low"
[69] "1e-04" "High" "High" "High" "High"
See the "1e-04"s? What is even stranger, let's pick value 69, one of the ones that return odd values:
data[69]
>1e-04
If I test this value alone, I get what I expected to get:
data[69]<=data_med
TRUE
Can someone explain this behavior? It just seems downright dangerous...
Let's walk through what you did here.
data=c(0.6666667, 0.8333, 0.6666667, 0.8333, 0.8333, 0.75, 0.9999, 0.7499667, 0.25, 0.6666667, 0.1667, 0.7499667, 0.5, 0.2500333, 0.3333667, 0.0834, 0.0001, 0.2500333, 0.8333, 0.9999, 0.9999, 0.2500333, 0.2500333, 0.3333667, 0.9166, 0.5, 0.2500333, 0.4166667, 0.0001, 0.1667333, 0.6666333, 0.0834, 0.1667, 0.6666333, 0.9166, 0.1667, 0.7499333, 0.9166, 0.9166, 0.9166, 0.7499667, 0.7499667, 0.4166667, 0.5, 0.2500333, 0.9166, 0.6666667, 0.1667333, 0.25, 0.0001, 0.3333667, 0.0001, 0.25, 0.0834, 0.9999, 0.0834, 0.1667, 0.5, 0.2500333, 0.3333667, 0.9166, 0.9166, 0.8333, 0.9166, 0.75, 0.0834, 0.4166667, 0.5, 0.0001, 0.9999, 0.8333, 0.6666667, 0.9166)
data_med=median(data) ## 0.5
quant_data=data ## irrelevant
quant_data[quant_data>data_med]="High"
But by doing this you have converted quant_data to a character vector:
str(quant_data)
## chr [1:73] "High" "High" "High" "High" "High" "High" "High" ...
Now the comparison between a character value and the data_med
value is almost meaningless, because data_med
will get coerced to a character value too:
"High" < "0.5" ## FALSE
"1e-4" < "0.5" ## FALSE -- this is your problem.
quant_data[quant_data<=data_med]="Low"
What you presumably meant to do (and a reason to assign quant_data=data
) was:
quant_data[data>data_med]="High"
quant_data[data<=data_med]="Low"
table(quant_data)
## High Low
## 35 38
As @Arun points out in comments above, quant_data <- ifelse(data>data_med,"High","Low")
would work too. So would an appropriate use of cut()
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With