Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to handle unseen categorical values in test data set using python?

Suppose I have location feature. In train data set its unique values are 'NewYork', 'Chicago'. But in test set it has 'NewYork', 'Chicago', 'London'. So while creating one hot encoding how to ignore 'London'? In other words, How not to encode the categories that only appear in the test set?

like image 359
Neo Avatar asked Jan 19 '17 04:01

Neo


2 Answers

Often you never want to eliminate information. You want to wrap this information prior within your model. For example you might have some data with NaN values:

train_data = ['NewYork', 'Chicago', NaN]

Solution 1

You will likely have a way of dealing with this, whether you impute, delete, etc.. is up to you based on the problem. More often than not you can have NaN be it's own category, as this is information as well. Something like this can suffice:

# function to replace NA in categorical variables
def fill_categorical_na(df, var_list):
  X = df.copy()
  X[var_list] = df[var_list].fillna('Missing')
  return X

# replace missing values with new label: "Missing"
X_train = fill_categorical_na(X_train, vars_with_na)
X_test = fill_categorical_na(X_test, vars_with_na)

Therefore, when you move to production you could write a script that pushes unseen categories into this "missing" category you've established earlier.

Solution 2

If you're not satisfied with that idea, you could always turn these unusual cases into a new unique category that we'll call "rare" because it's not present often.

train_data = ['NewYork', 'Chicago', 'NewYork', 'Chicago', 'London']

# let's capture the categorical variables first
cat_vars = [var for var in X_train.columns if X_train[var].dtype == 'O']

def find_frequent_labels(df, var, rare_perc):
  df = df.copy()
  tmp = df.groupby(var)['Target_Variable'].count() / len(df)
  return tmp[tmp>rare_perc].index

for var in cat_vars:
  frequent_ls = find_frequent_labels(X_train, var, 0.01)
  X_train[var] = np.where(X_train[var].isin(frequent_ls), X_train[var], 'Rare')
  X_test[var] = np.where(X_test[var].isin(frequent_ls), X_test[var], 'Rare')

Now, given enough instances of the "normal" categories, London will get pushed into the "Rare" category. Regardless of how many new categories might show up, they will be grouped into 'Rare' as a category; pending they remain rare instances and don't become dominate categories.

like image 82
kevin_theinfinityfund Avatar answered Nov 01 '22 12:11

kevin_theinfinityfund


You can use the parameter handle_unknown in one hot encoding.

ohe = OneHotEncoder(handle_unknown=‘ignore’)

This will not show an error and will let execution occur.

See Documentation for more https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html

like image 26
devansh Avatar answered Nov 01 '22 10:11

devansh