How to handle unseen categorical values in test data set using python?

Question

Suppose I have location feature. In train data set its unique values are 'NewYork', 'Chicago'. But in test set it has 'NewYork', 'Chicago', 'London'. So while creating one hot encoding how to ignore 'London'? In other words, How not to encode the categories that only appear in the test set?

kevin_theinfinityfund · Accepted Answer

Often you never want to eliminate information. You want to wrap this information prior within your model. For example you might have some data with NaN values:

train_data = ['NewYork', 'Chicago', NaN]

Solution 1

You will likely have a way of dealing with this, whether you impute, delete, etc.. is up to you based on the problem. More often than not you can have NaN be it's own category, as this is information as well. Something like this can suffice:

# function to replace NA in categorical variables
def fill_categorical_na(df, var_list):
  X = df.copy()
  X[var_list] = df[var_list].fillna('Missing')
  return X

# replace missing values with new label: "Missing"
X_train = fill_categorical_na(X_train, vars_with_na)
X_test = fill_categorical_na(X_test, vars_with_na)

Therefore, when you move to production you could write a script that pushes unseen categories into this "missing" category you've established earlier.

Solution 2

If you're not satisfied with that idea, you could always turn these unusual cases into a new unique category that we'll call "rare" because it's not present often.

train_data = ['NewYork', 'Chicago', 'NewYork', 'Chicago', 'London']

# let's capture the categorical variables first
cat_vars = [var for var in X_train.columns if X_train[var].dtype == 'O']

def find_frequent_labels(df, var, rare_perc):
  df = df.copy()
  tmp = df.groupby(var)['Target_Variable'].count() / len(df)
  return tmp[tmp>rare_perc].index

for var in cat_vars:
  frequent_ls = find_frequent_labels(X_train, var, 0.01)
  X_train[var] = np.where(X_train[var].isin(frequent_ls), X_train[var], 'Rare')
  X_test[var] = np.where(X_test[var].isin(frequent_ls), X_test[var], 'Rare')

Now, given enough instances of the "normal" categories, London will get pushed into the "Rare" category. Regardless of how many new categories might show up, they will be grouped into 'Rare' as a category; pending they remain rare instances and don't become dominate categories.

devansh · Answer

You can use the parameter handle_unknown in one hot encoding.

ohe = OneHotEncoder(handle_unknown=‘ignore’)

This will not show an error and will let execution occur.

See Documentation for more https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html

How to handle unseen categorical values in test data set using python?

Tags:

python

machine-learning

one-hot-encoding

categorical-data

feature-extraction

Neo

2 Answers

Solution 1

Solution 2

kevin_theinfinityfund

devansh

Recent Activity

Donate For Us

How to handle unseen categorical values in test data set using python?

Tags:

python

machine-learning

one-hot-encoding

categorical-data

feature-extraction

Neo

2 Answers

Solution 1

Solution 2

kevin_theinfinityfund

devansh

Related questions

Recent Activity

Donate For Us