In python, scikit has a great function called LabelEncoder that maps categorical levels (strings) to integer representation.
Is there anything in R to do this? For example if there is a variable called color with values {'Blue','Red','Green'} the encoder would translate:
Blue => 1
Green => 2
Red => 3
and create an object with this mapping to then use for transforming new data in a similar fashion.
Add: It doesn't seem like just factors will work because there is no persisting of the mapping. If the new data has an unseen level from the training data, the entire structure changes. Ideally I would like the new levels labeled missing or 'other' somehow.
sample_dat <- data.frame(a_str=c('Red','Blue','Blue','Red','Green'))
sample_dat$a_int<-as.integer(as.factor(sample_dat$a_str))
sample_dat$a_int
#[1] 3 1 1 3 2
sample_dat2 <- data.frame(a_str=c('Red','Blue','Blue','Red','Green','Azure'))
sample_dat2$a_int<-as.integer(as.factor(sample_dat2$a_str))
sample_dat2$a_int
# [1] 4 2 2 4 3 1
Often in machine learning, we want to convert categorical variables into some type of numeric format that can be readily used by algorithms. One way to do this is through label encoding, which assigns each categorical value an integer value based on alphabetical order.
Label Encoding refers to converting the labels into a numeric form so as to convert them into the machine-readable form. Machine learning algorithms can then decide in a better way how those labels must be operated. It is an important pre-processing step for the structured dataset in supervised learning.
Label Encoder: Sklearn provides a very efficient tool for encoding the levels of categorical features into numeric values. LabelEncoder encode labels with a value between 0 and n_classes-1 where n is the number of distinct labels. If a label repeats it assigns the same value to as assigned earlier.
This encoding technique appears almost similar to Label Encoding. But, label encoding would not consider whether a variable is ordinal or not, but in the case of ordinal encoding, it will assign a sequence of numerical values as per the order of data.
Create your vector of data:
colors <- c("red", "red", "blue", "green")
Create a factor:
factors <- factor(colors)
Convert the factor to numbers:
as.numeric(factors)
Output: (note that this is in alphabetical order)
# [1] 3 3 1 2
You can also set a custom numbering system: (note that the output now follows the "rainbow color order" that I defined)
rainbow <- c("red","orange","yellow","green","blue","purple")
ordered <- factor(colors, levels = rainbow)
as.numeric(ordered)
# [1] 1 1 5 4
See ?factor
.
Try CatEncoders package. It replicates the Python sklearn.preprocessing
functionality.
# variable to encode values
colors = c("red", "red", "blue", "green")
lab_enc = LabelEncoder.fit(colors)
# new values are transformed to NA
values = transform(lab_enc, c('red', 'red', 'yellow'))
values
# [1] 3 3 NA
# doing the inverse: given the encoded numbers return the labels
inverse.transform(lab_enc, values)
# [1] "red" "red" NA
I would add the functionality of reporting the non-matching labels with a warning.
PS: It also has the OneHotEncoder
function.
If I correctly understand what do you want:
# function which returns function which will encode vectors with values of 'vec'
label_encoder = function(vec){
levels = sort(unique(vec))
function(x){
match(x, levels)
}
}
colors = c("red", "red", "blue", "green")
color_encoder = label_encoder(colors) # create encoder
encoded_colors = color_encoder(colors) # encode colors
encoded_colors
new_colors = c("blue", "green", "green") # new vector
encoded_new_colors = color_encoder(new_colors)
encoded_new_colors
other_colors = c("blue", "green", "green", "yellow")
color_encoder(other_colors) # NA's are introduced
# save and restore to disk
saveRDS(color_encoder, "color_encoder.RDS")
c_encoder = readRDS("color_encoder.RDS")
c_encoder(colors) # same result
# dealing with multiple columns
# create data.frame
set.seed(123) # make result reproducible
color_dataframe = as.data.frame(
matrix(
sample(c("red", "blue", "green", "yellow"), 12, replace = TRUE),
ncol = 3)
)
color_dataframe
# encode each column
for (column in colnames(color_dataframe)){
color_dataframe[[column]] = color_encoder(color_dataframe[[column]])
}
color_dataframe
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With