Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to apply a function to each element of a data.frame?

Tags:

r

I want to convert a numeric value into a factor, if the value is below -2 then "down" should be the factor, if it is above 2 then "up" and between "no_change":

So far I thought about creating a function:

classifier <- function(x){
    if (x >= 2){
      return(as.factor("up"))
    }else if (x <= -2){
      return(as.factor("down"))
    }else {
      return(as.factor("no_change"))
    }
}

I could make it iterate (with a for loop) over the input and return a list, so I could use it with apply.

I want to apply this function to all cells of the data.frame, how can I do it?

mock data (runif(15, min=-5, max=5)):

c(1.11004611710086, -1.86842617811635, 1.72159335808828, -2.68788822228089, 
2.72551498375833, 3.67290901951492, -4.00984475389123, -2.39582793787122, 
4.22395745059475, -0.360892189200968, 1.35027756914496, 2.89919016882777, 
-0.158692332915962, -0.950306688901037, 3.39141107397154)
like image 316
llrs Avatar asked Feb 07 '23 18:02

llrs


1 Answers

Using DF <- iris[-5] as sample data, you can use cut, as I suggested in the comments.

Try:

DF[] <- lapply(DF, cut, c(-Inf, -2, 2, Inf), c("down", "no_change", "up"))

head(DF)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1           up          up    no_change   no_change
## 2           up          up    no_change   no_change
## 3           up          up    no_change   no_change
## 4           up          up    no_change   no_change
## 5           up          up    no_change   no_change
## 6           up          up    no_change   no_change

tail(DF)
##     Sepal.Length Sepal.Width Petal.Length Petal.Width
## 145           up          up           up          up
## 146           up          up           up          up
## 147           up          up           up   no_change
## 148           up          up           up   no_change
## 149           up          up           up          up
## 150           up          up           up   no_change

Or, with RHertel's "mock_data":

cut(mock_data, c(-Inf, -2, 2, Inf), c("down", "no_change", "up"))
##  [1] no_change no_change no_change down      up        up        down     
##  [8] down      up        no_change no_change up        no_change no_change
## [15] up       
## Levels: down no_change up

Benchmarks

As I suggested in the comments, RHertel's approach is likely to be the most efficient. That approach uses pretty straightforward subsetting (which is fast) and factor (which is also generally fast).

On data the size you describe, you will definitely notice the difference:

set.seed(1)
nrow = 20000
ncol = 1000
x <- as.data.frame(matrix(runif(nrow * ncol, min=-5, max=5), ncol = ncol))

factorize <- function(invec) {
  factorized <- rep("no_change", length(invec))
  factorized[invec > 2]  <- "up"
  factorized[invec < -2]  <- "down"
  factor(factorized, c("down", "no_change", "up"))
}

RHfun <- function(indf = x) {
  indf[] <- lapply(indf, factorize)
  indf
}

AMfun <- function(DF = x) {
  DF[] <- lapply(DF, cut, c(-Inf, -2, 2, Inf), c("down", "no_change", "up"))
  DF
}

library(microbenchmark)
microbenchmark(AMfun(), RHfun(), times = 10)
# Unit: seconds
#     expr      min       lq     mean   median       uq       max neval
#  AMfun() 7.501814 8.015532 8.852863 8.731638 9.660191 10.198983    10
#  RHfun() 1.437696 1.485791 1.723402 1.574507 1.637139  2.528574    10
like image 142
A5C1D2H2I1M1N2O1R2T1 Avatar answered Feb 16 '23 04:02

A5C1D2H2I1M1N2O1R2T1