Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Label Encoder functionality in R?

Tags:

r

In python, scikit has a great function called LabelEncoder that maps categorical levels (strings) to integer representation.

Is there anything in R to do this? For example if there is a variable called color with values {'Blue','Red','Green'} the encoder would translate:

Blue => 1
Green => 2
Red => 3

and create an object with this mapping to then use for transforming new data in a similar fashion.

Add: It doesn't seem like just factors will work because there is no persisting of the mapping. If the new data has an unseen level from the training data, the entire structure changes. Ideally I would like the new levels labeled missing or 'other' somehow.

sample_dat <- data.frame(a_str=c('Red','Blue','Blue','Red','Green'))
sample_dat$a_int<-as.integer(as.factor(sample_dat$a_str))
sample_dat$a_int
#[1] 3 1 1 3 2
sample_dat2 <- data.frame(a_str=c('Red','Blue','Blue','Red','Green','Azure'))
sample_dat2$a_int<-as.integer(as.factor(sample_dat2$a_str))
sample_dat2$a_int
# [1] 4 2 2 4 3 1
like image 864
B_Miner Avatar asked Jul 27 '16 18:07

B_Miner


People also ask

How do I use label encoder in R?

Often in machine learning, we want to convert categorical variables into some type of numeric format that can be readily used by algorithms. One way to do this is through label encoding, which assigns each categorical value an integer value based on alphabetical order.

What is the function of label encoder in coding?

Label Encoding refers to converting the labels into a numeric form so as to convert them into the machine-readable form. Machine learning algorithms can then decide in a better way how those labels must be operated. It is an important pre-processing step for the structured dataset in supervised learning.

How does Sklearn label encoder work?

Label Encoder: Sklearn provides a very efficient tool for encoding the levels of categorical features into numeric values. LabelEncoder encode labels with a value between 0 and n_classes-1 where n is the number of distinct labels. If a label repeats it assigns the same value to as assigned earlier.

What is the difference between label encoding and ordinal encoding?

This encoding technique appears almost similar to Label Encoding. But, label encoding would not consider whether a variable is ordinal or not, but in the case of ordinal encoding, it will assign a sequence of numerical values as per the order of data.


3 Answers

Create your vector of data:

colors <- c("red", "red", "blue", "green")

Create a factor:

factors <- factor(colors)

Convert the factor to numbers:

as.numeric(factors)

Output: (note that this is in alphabetical order)

# [1] 3 3 1 2

You can also set a custom numbering system: (note that the output now follows the "rainbow color order" that I defined)

rainbow <- c("red","orange","yellow","green","blue","purple")
ordered <- factor(colors, levels = rainbow)
as.numeric(ordered)
# [1] 1 1 5 4

See ?factor.

like image 75
tluh Avatar answered Sep 20 '22 13:09

tluh


Try CatEncoders package. It replicates the Python sklearn.preprocessing functionality.

# variable to encode values
colors = c("red", "red", "blue", "green")
lab_enc = LabelEncoder.fit(colors)

# new values are transformed to NA
values = transform(lab_enc, c('red', 'red', 'yellow'))
values

# [1]  3  3 NA


# doing the inverse: given the encoded numbers return the labels
inverse.transform(lab_enc, values)
# [1] "red" "red" NA   

I would add the functionality of reporting the non-matching labels with a warning.

PS: It also has the OneHotEncoder function.

like image 31
Pablo Casas Avatar answered Sep 19 '22 13:09

Pablo Casas


If I correctly understand what do you want:

# function which returns function which will encode vectors with values  of 'vec' 
label_encoder = function(vec){
    levels = sort(unique(vec))
    function(x){
        match(x, levels)
    }
}

colors = c("red", "red", "blue", "green")

color_encoder = label_encoder(colors) # create encoder

encoded_colors = color_encoder(colors) # encode colors
encoded_colors

new_colors = c("blue", "green", "green")  # new vector
encoded_new_colors = color_encoder(new_colors)
encoded_new_colors

other_colors = c("blue", "green", "green", "yellow") 
color_encoder(other_colors) # NA's are introduced

# save and restore to disk
saveRDS(color_encoder, "color_encoder.RDS")
c_encoder = readRDS("color_encoder.RDS")
c_encoder(colors) # same result

# dealing with multiple columns

# create data.frame
set.seed(123) # make result reproducible
color_dataframe = as.data.frame(
    matrix(
        sample(c("red", "blue", "green",  "yellow"), 12, replace = TRUE),
        ncol = 3)
)
color_dataframe

# encode each column
for (column in colnames(color_dataframe)){
    color_dataframe[[column]] = color_encoder(color_dataframe[[column]])
}
color_dataframe
like image 44
Gregory Demin Avatar answered Sep 18 '22 13:09

Gregory Demin