I am looking to create a function that takes in the training set and the testing set as its arguments, min-max scales/normalizes and returns the training set and uses those same values of minimum and range to min-max scale/normalize and return the test set.
So far this is the function I have come up with:
min_max_scaling <- function(train, test){
min_vals <- sapply(train, min)
range1 <- sapply(train, function(x) diff(range(x)))
# scale the training data
train_scaled <- data.frame(matrix(nrow = nrow(train), ncol = ncol(train)))
for(i in seq_len(ncol(train))){
column <- (train[,i] - min_vals[i])/range1[i]
train_scaled[i] <- column
}
colnames(train_scaled) <- colnames(train)
# scale the testing data using the min and range of the train data
test_scaled <- data.frame(matrix(nrow = nrow(test), ncol = ncol(test)))
for(i in seq_len(ncol(test))){
column <- (test[,i] - min_vals[i])/range1[i]
test_scaled[i] <- column
}
colnames(test_scaled) <- colnames(test)
return(list(train = train_scaled, test = test_scaled))
}
The definition of min max scaling is similar to this question asked earlier on SO - Normalisation of a two column data using min and max values
My questions are:
1. Is there a way to vectorize the two for
loops in the function? e.g. using sapply()
2. Are there any packages that allow us to do what we are looking to do here?
Normalize Data with Min-Max Scaling in R With Min-Max Scaling, we scale the data values between a range of 0 to 1 only. Due to this, the effect of outliers on the data values suppresses to a certain extent. Moreover, it helps us have a smaller value of the standard deviation of the data scale.
Normalization (Min-Max Scalar) Normalization is useful in models such as k-nearest neighbors and artificial neural networks, or anywhere where the data we are using has varying scales or precision (this will be more clear in the example below).
A Min-Max scaling is typically done via the following equation: Xsc=X−XminXmax−Xmin. One family of algorithms that is scale-invariant encompasses tree-based learning algorithms.
Here is the code for the min-max normalization. See this Wikipedia page for the formulae, and also other ways of performing feature scaling.
normalize <- function(x, na.rm = TRUE) {
return((x- min(x)) /(max(x)-min(x)))
}
To get a vector, use apply
instead of lapply
.
as.data.frame(apply(df$name, normalize))
Update to address Holger's suggestion.
If you want to pass additional arguments to min()
and max()
, e.g., na.rm
, then you can use:
normalize <- function(x, ...) {
return((x - min(x, ...)) /(max(x, ...) - min(x, ...)))
}
x <- c(1, NA, 2, 3)
normalize(a)
# [1] NA NA NA NA
normalize(a, na.rm = TRUE)
# 0.0 NA 0.5 1.0
Just keep in mind, that whatever you pass to min()
via the ellipsis ...
you also implicitly pass to max()
. In this case, this shouldn't be a big problem since both min()
and max()
share the same function signature.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With