I want to convert a numeric value into a factor, if the value is below -2 then "down" should be the factor, if it is above 2 then "up" and between "no_change":
So far I thought about creating a function:
classifier <- function(x){
if (x >= 2){
return(as.factor("up"))
}else if (x <= -2){
return(as.factor("down"))
}else {
return(as.factor("no_change"))
}
}
I could make it iterate (with a for loop) over the input and return a list, so I could use it with apply.
I want to apply this function to all cells of the data.frame, how can I do it?
mock data (runif(15, min=-5, max=5)
):
c(1.11004611710086, -1.86842617811635, 1.72159335808828, -2.68788822228089,
2.72551498375833, 3.67290901951492, -4.00984475389123, -2.39582793787122,
4.22395745059475, -0.360892189200968, 1.35027756914496, 2.89919016882777,
-0.158692332915962, -0.950306688901037, 3.39141107397154)
Using DF <- iris[-5]
as sample data, you can use cut
, as I suggested in the comments.
Try:
DF[] <- lapply(DF, cut, c(-Inf, -2, 2, Inf), c("down", "no_change", "up"))
head(DF)
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1 up up no_change no_change
## 2 up up no_change no_change
## 3 up up no_change no_change
## 4 up up no_change no_change
## 5 up up no_change no_change
## 6 up up no_change no_change
tail(DF)
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## 145 up up up up
## 146 up up up up
## 147 up up up no_change
## 148 up up up no_change
## 149 up up up up
## 150 up up up no_change
Or, with RHertel's "mock_data":
cut(mock_data, c(-Inf, -2, 2, Inf), c("down", "no_change", "up"))
## [1] no_change no_change no_change down up up down
## [8] down up no_change no_change up no_change no_change
## [15] up
## Levels: down no_change up
Benchmarks
As I suggested in the comments, RHertel's approach is likely to be the most efficient. That approach uses pretty straightforward subsetting (which is fast) and factor
(which is also generally fast).
On data the size you describe, you will definitely notice the difference:
set.seed(1)
nrow = 20000
ncol = 1000
x <- as.data.frame(matrix(runif(nrow * ncol, min=-5, max=5), ncol = ncol))
factorize <- function(invec) {
factorized <- rep("no_change", length(invec))
factorized[invec > 2] <- "up"
factorized[invec < -2] <- "down"
factor(factorized, c("down", "no_change", "up"))
}
RHfun <- function(indf = x) {
indf[] <- lapply(indf, factorize)
indf
}
AMfun <- function(DF = x) {
DF[] <- lapply(DF, cut, c(-Inf, -2, 2, Inf), c("down", "no_change", "up"))
DF
}
library(microbenchmark)
microbenchmark(AMfun(), RHfun(), times = 10)
# Unit: seconds
# expr min lq mean median uq max neval
# AMfun() 7.501814 8.015532 8.852863 8.731638 9.660191 10.198983 10
# RHfun() 1.437696 1.485791 1.723402 1.574507 1.637139 2.528574 10
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With