Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

min max scaling/normalization in r for train and test data

Tags:

r

I am looking to create a function that takes in the training set and the testing set as its arguments, min-max scales/normalizes and returns the training set and uses those same values of minimum and range to min-max scale/normalize and return the test set.

So far this is the function I have come up with:

min_max_scaling <- function(train, test){

  min_vals <- sapply(train, min)
  range1 <- sapply(train, function(x) diff(range(x)))

  # scale the training data

  train_scaled <- data.frame(matrix(nrow = nrow(train), ncol = ncol(train)))

  for(i in seq_len(ncol(train))){
    column <- (train[,i] - min_vals[i])/range1[i]
    train_scaled[i] <- column
  }

  colnames(train_scaled) <- colnames(train)

  # scale the testing data using the min and range of the train data

  test_scaled <- data.frame(matrix(nrow = nrow(test), ncol = ncol(test)))

  for(i in seq_len(ncol(test))){
    column <- (test[,i] - min_vals[i])/range1[i]
    test_scaled[i] <- column
  }

  colnames(test_scaled) <- colnames(test)

  return(list(train = train_scaled, test = test_scaled))
}

The definition of min max scaling is similar to this question asked earlier on SO - Normalisation of a two column data using min and max values

My questions are:
1. Is there a way to vectorize the two for loops in the function? e.g. using sapply()
2. Are there any packages that allow us to do what we are looking to do here?

like image 775
Jash Shah Avatar asked May 18 '17 14:05

Jash Shah


People also ask

How do you do MIN-MAX scaling in R?

Normalize Data with Min-Max Scaling in R With Min-Max Scaling, we scale the data values between a range of 0 to 1 only. Due to this, the effect of outliers on the data values suppresses to a certain extent. Moreover, it helps us have a smaller value of the standard deviation of the data scale.

When should I use MIN-MAX normalization?

Normalization (Min-Max Scalar) Normalization is useful in models such as k-nearest neighbors and artificial neural networks, or anywhere where the data we are using has varying scales or precision (this will be more clear in the example below).

How do you apply MIN-MAX scaling?

A Min-Max scaling is typically done via the following equation: Xsc=X−XminXmax−Xmin. One family of algorithms that is scale-invariant encompasses tree-based learning algorithms.


1 Answers

Here is the code for the min-max normalization. See this Wikipedia page for the formulae, and also other ways of performing feature scaling.

normalize <- function(x, na.rm = TRUE) {
    return((x- min(x)) /(max(x)-min(x)))
}

To get a vector, use apply instead of lapply.

as.data.frame(apply(df$name, normalize))

Update to address Holger's suggestion.

If you want to pass additional arguments to min() and max(), e.g., na.rm, then you can use:

normalize <- function(x, ...) {
    return((x - min(x, ...)) /(max(x, ...) - min(x, ...)))
}

x <- c(1, NA, 2, 3)

normalize(a)
# [1] NA NA NA NA

normalize(a, na.rm = TRUE)
# 0.0  NA 0.5 1.0

Just keep in mind, that whatever you pass to min() via the ellipsis ... you also implicitly pass to max(). In this case, this shouldn't be a big problem since both min() and max() share the same function signature.

like image 76
Somayajulu Evr Avatar answered Oct 24 '22 11:10

Somayajulu Evr