Getting cardinality from ordinal encoding in Scikit-learn

Question

I'm using the OrdinalEncoder to encode categorical data in Scikit-learn and I'm looking for a way to get details about the encoding. I.e. the cardinality of each feature or even the exact mapping between the numbers and categories.

Short of the inverse_transform method I can't see a way of doing this. I want to do this as generally as possible, i.e. without knowing the categories in advance.

I'm aware of the issues with ordinal encoding (onehot is not an option for me). I've also looked at DictVectorizer but I am not sure whether it is appropriate.

Gambit1614 · Accepted Answer

Okay, so I recreated the official documentation example,

from sklearn.preprocessing import OrdinalEncoder
enc = OrdinalEncoder()
X = [['Male', 1], ['Female', 3], ['Female', 2]]
enc.fit(X)

Now, if you want to see the encoding, you simply call the categories_ attribute, so in this case:

print(enc.categories_)
#Output: [array(['Female', 'Male'], dtype=object), array([1, 2, 3], dtype=object)]

Now, this only returns the encoded features and not their encoding. However, their index itself is the encoding. For example, in this case, Female is encode to 0, Male is encoded to 1, then moving forward to the next set of features, 1 is encoded as 0, 2 is encoded as 1 and so on.

So, if I want to get the encoding of Female and Male:

encoding = enc.categories_
encoding_sex = dict(zip((encoding[0]), range(len(encoding[0]))))
print(encoding_sex)
# Output: {'Female': 0, 'Male': 1}

Now if you want to generalize the above method for all features and make it fast as well, do the following :

encoding = enc.categories_
encoding_feature = lambda x: dict(zip(x, range(len(x))))
encoding_full = [encoding_feature(feature_elem) for feature_elem in encoding]
print(encoding_full)
# Output: [{'Female': 0, 'Male': 1}, {1: 0, 2: 1, 3: 2}]

Venkatachalam · Answer

categories_ does have the mapping, which is required for inverse_transform. You can have a look at the inverse_transform code here.

May be you are looking for a explicit dictionary between numbers and categories, then use

>>> from sklearn.preprocessing import OrdinalEncoder
>>> enc = OrdinalEncoder()
>>> X = [['Male', 1], ['Female', 3], ['Female', 2]]
>>> enc.fit(X)
... 
OrdinalEncoder(categories='auto', dtype=<... 'numpy.float64'>)

>>> [dict(enumerate(mapping)) for mapping in enc.categories_]
# [{0: 'Female', 1: 'Male'}, {0: 1, 1: 2, 2: 3}]

Getting cardinality from ordinal encoding in Scikit-learn

Tags:

encoding

scikit-learn

categorical-data

robtherobot101

2 Answers

Gambit1614

Venkatachalam

Recent Activity

Donate For Us

Getting cardinality from ordinal encoding in Scikit-learn

Tags:

encoding

scikit-learn

categorical-data

robtherobot101

2 Answers

Gambit1614

Venkatachalam

Related questions

Recent Activity

Donate For Us