Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Getting cardinality from ordinal encoding in Scikit-learn

I'm using the OrdinalEncoder to encode categorical data in Scikit-learn and I'm looking for a way to get details about the encoding. I.e. the cardinality of each feature or even the exact mapping between the numbers and categories.

Short of the inverse_transform method I can't see a way of doing this. I want to do this as generally as possible, i.e. without knowing the categories in advance.

I'm aware of the issues with ordinal encoding (onehot is not an option for me). I've also looked at DictVectorizer but I am not sure whether it is appropriate.

like image 689
robtherobot101 Avatar asked Mar 25 '26 20:03

robtherobot101


2 Answers

Okay, so I recreated the official documentation example,

from sklearn.preprocessing import OrdinalEncoder
enc = OrdinalEncoder()
X = [['Male', 1], ['Female', 3], ['Female', 2]]
enc.fit(X)

Now, if you want to see the encoding, you simply call the categories_ attribute, so in this case:

print(enc.categories_)
#Output: [array(['Female', 'Male'], dtype=object), array([1, 2, 3], dtype=object)]

Now, this only returns the encoded features and not their encoding. However, their index itself is the encoding. For example, in this case, Female is encode to 0, Male is encoded to 1, then moving forward to the next set of features, 1 is encoded as 0, 2 is encoded as 1 and so on.

So, if I want to get the encoding of Female and Male:

encoding = enc.categories_
encoding_sex = dict(zip((encoding[0]), range(len(encoding[0]))))
print(encoding_sex)
# Output: {'Female': 0, 'Male': 1}

Now if you want to generalize the above method for all features and make it fast as well, do the following :

encoding = enc.categories_
encoding_feature = lambda x: dict(zip(x, range(len(x))))
encoding_full = [encoding_feature(feature_elem) for feature_elem in encoding]
print(encoding_full)
# Output: [{'Female': 0, 'Male': 1}, {1: 0, 2: 1, 3: 2}]
like image 181
Gambit1614 Avatar answered Mar 27 '26 09:03

Gambit1614


categories_ does have the mapping, which is required for inverse_transform. You can have a look at the inverse_transform code here.

May be you are looking for a explicit dictionary between numbers and categories, then use

>>> from sklearn.preprocessing import OrdinalEncoder
>>> enc = OrdinalEncoder()
>>> X = [['Male', 1], ['Female', 3], ['Female', 2]]
>>> enc.fit(X)
... 
OrdinalEncoder(categories='auto', dtype=<... 'numpy.float64'>)

>>> [dict(enumerate(mapping)) for mapping in enc.categories_]
# [{0: 'Female', 1: 'Male'}, {0: 1, 1: 2, 2: 3}]
like image 24
Venkatachalam Avatar answered Mar 27 '26 09:03

Venkatachalam



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!