I am looking to create a function that takes in the training set and the testing set as its arguments, min-max scales/normalizes and returns the training set and uses those same values of minimum and range to min-max scale/normalize and return the test set. So far this is the function I have come up with: <pre class="prettyprint"><code>min_max_scaling <- function(train, test){ min_vals <- sapply(train, min) range1 <- sapply(train, function(x) diff(range(x))) # scale the training data train_scaled <- data.frame(matrix(nrow = nrow(train), ncol = ncol(train))) for(i in seq_len(ncol(train))){ column <- (train[,i] - min_vals[i])/range1[i] train_scaled[i] <- column } colnames(train_scaled) <- colnames(train) # scale the testing data using the min and range of the train data test_scaled <- data.frame(matrix(nrow = nrow(test), ncol = ncol(test))) for(i in seq_len(ncol(test))){ column <- (test[,i] - min_vals[i])/range1[i] test_scaled[i] <- column } colnames(test_scaled) <- colnames(test) return(list(train = train_scaled, test = test_scaled)) } </code></pre> The definition of min max scaling is similar to this question asked earlier on SO - Normalisation of a two column data using min and max values My questions are: 1. Is there a way to vectorize the two <code>for</code> loops in the function? e.g. using <code>sapply()</code> 2. Are there any packages that allow us to do what we are looking to do here?

Here is the code for the min-max normalization. See this Wikipedia page for the formulae, and also other ways of performing feature scaling. <pre class="prettyprint"><code>normalize <- function(x, na.rm = TRUE) { return((x- min(x)) /(max(x)-min(x))) } </code></pre> To get a vector, use <code>apply</code> instead of <code>lapply</code>. <pre class="prettyprint lang-r prettyprint-override"><code>as.data.frame(apply(df$name, normalize)) </code></pre> <hr> Update to address Holger's suggestion. If you want to pass additional arguments to <code>min()</code> and <code>max()</code>, e.g., <code>na.rm</code>, then you can use: <pre class="prettyprint lang-r prettyprint-override"><code>normalize <- function(x, ...) { return((x - min(x, ...)) /(max(x, ...) - min(x, ...))) } x <- c(1, NA, 2, 3) normalize(a) # [1] NA NA NA NA normalize(a, na.rm = TRUE) # 0.0 NA 0.5 1.0 </code></pre> Just keep in mind, that whatever you pass to <code>min()</code> via the ellipsis <code>...</code> you also implicitly pass to <code>max()</code>. In this case, this shouldn't be a big problem since both <code>min()</code> and <code>max()</code> share the same function signature.

min max scaling/normalization in r for train and test data

Tags:

r

I am looking to create a function that takes in the training set and the testing set as its arguments, min-max scales/normalizes and returns the training set and uses those same values of minimum and range to min-max scale/normalize and return the test set.

So far this is the function I have come up with:

min_max_scaling <- function(train, test){

  min_vals <- sapply(train, min)
  range1 <- sapply(train, function(x) diff(range(x)))

  # scale the training data

  train_scaled <- data.frame(matrix(nrow = nrow(train), ncol = ncol(train)))

  for(i in seq_len(ncol(train))){
    column <- (train[,i] - min_vals[i])/range1[i]
    train_scaled[i] <- column
  }

  colnames(train_scaled) <- colnames(train)

  # scale the testing data using the min and range of the train data

  test_scaled <- data.frame(matrix(nrow = nrow(test), ncol = ncol(test)))

  for(i in seq_len(ncol(test))){
    column <- (test[,i] - min_vals[i])/range1[i]
    test_scaled[i] <- column
  }

  colnames(test_scaled) <- colnames(test)

  return(list(train = train_scaled, test = test_scaled))
}

The definition of min max scaling is similar to this question asked earlier on SO - Normalisation of a two column data using min and max values

My questions are:
1. Is there a way to vectorize the two for loops in the function? e.g. using sapply()
2. Are there any packages that allow us to do what we are looking to do here?

775

asked May 18 '17 14:05

Jash Shah

1 Answers

Here is the code for the min-max normalization. See this Wikipedia page for the formulae, and also other ways of performing feature scaling.

normalize <- function(x, na.rm = TRUE) {
    return((x- min(x)) /(max(x)-min(x)))
}

To get a vector, use apply instead of lapply.

as.data.frame(apply(df$name, normalize))

Update to address Holger's suggestion.

If you want to pass additional arguments to min() and max(), e.g., na.rm, then you can use:

normalize <- function(x, ...) {
    return((x - min(x, ...)) /(max(x, ...) - min(x, ...)))
}

x <- c(1, NA, 2, 3)

normalize(a)
# [1] NA NA NA NA

normalize(a, na.rm = TRUE)
# 0.0  NA 0.5 1.0

Just keep in mind, that whatever you pass to min() via the ellipsis ... you also implicitly pass to max(). In this case, this shouldn't be a big problem since both min() and max() share the same function signature.

answered Oct 24 '22 11:10

Somayajulu Evr

Related questions
                            
                                More efficient means of creating a corpus and DTM with 4M rows
                            
                                Data frame to word table?
                            
                                add local image file in R presentation
                            
                                How to merge and sum two data frames
                            
                                Deleting tmp files
                            
                                Creating a temporal range time-series spiral plot
                            
                                R: How to replace elements of a data.frame?
                            
                                How to do a regression of a series of variables without typing each variable name
                            
                                List of lists to dataframe in R
                            
                                Suppressing some messages in R but leaving others?
                            
                                R code coverage for the testthat package
                            
                                Handling dates when we switch to daylight savings time and back
                            
                                extract RGB channels from a jpeg image in R
                            
                                Multiple time series in one plot
                            
                                neuralnet: overcoming the non convergence of algorithm
                            
                                Using expression(paste( to insert math notation into a legend
                            
                                Are dataframe[ ,-1] and dataframe[-1] the same?
                            
                                How to retrieve overall accuracy value from confusionMatrix in R?
                            
                                Protect/encrypt R package code for distribution [closed]
                            
                                R Shiny input slider range values

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With