Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Multi label encoding for classes with duplicates

Tags:

How can I n-hot encode a column of lists with duplicates?

Something like MultiLabelBinarizer from sklearn which counts the number of instances of duplicate classes instead of binarizing.

Example input:

x = pd.Series([['a', 'b', 'a'], ['b', 'c'], ['c','c']])

Expected output:

    a   b   c
0   2   1   0
1   0   1   1
2   0   0   2
like image 399
brandoldperson Avatar asked Aug 06 '19 07:08

brandoldperson


People also ask

What is multi label Binarizer?

Multilabelbinarizer allows you to encode multiple labels per instance. To translate the resulting array, you could build a DataFrame with this array and the encoded classes (through its "classes_" attribute). binarizer = MultiLabelBinarizer() pd.DataFrame(binarizer.fit_transform(y), columns=binarizer.classes_)

When should label Encoding be used?

Use LabelEncoder when there are only two possible values of a categorical features. For example, features having value such as yes or no. Or, maybe, gender feature when there are only two possible values including male or female.

What is label Encoding in NLP?

Label Encoding is a popular encoding technique for handling categorical variables. In this technique, each label is assigned a unique integer based on alphabetical ordering.

How is label Encoding done in Python?

Label Encoding refers to converting the labels into a numeric form so as to convert them into the machine-readable form. Machine learning algorithms can then decide in a better way how those labels must be operated. It is an important pre-processing step for the structured dataset in supervised learning.


1 Answers

I have written a new class MultiLabelCounter based on the MultiLabelBinarizer code.

import itertools
import numpy as np

class MultiLabelCounter():
    def __init__(self, classes=None):
        self.classes_ = classes

    def fit(self,y):
        self.classes_ = sorted(set(itertools.chain.from_iterable(y)))
        self.mapping = dict(zip(self.classes_,
                                         range(len(self.classes_))))
        return self

    def transform(self,y):
        yt = []
        for labels in y:
            data = [0]*len(self.classes_)
            for label in labels:
                data[self.mapping[label]] +=1
            yt.append(data)
        return yt

    def fit_transform(self,y):
        return self.fit(y).transform(y)
import pandas as pd
x = pd.Series([['a', 'b', 'a'], ['b', 'c'], ['c','c']])

mlc = MultiLabelCounter()
mlc.fit_transform(x)

# [[2, 1, 0], [0, 1, 1], [0, 0, 2]]
like image 99
Venkatachalam Avatar answered Sep 30 '22 20:09

Venkatachalam