Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Patsy: New levels in categorical fields in test data

I am trying to use Patsy (with sklearn, pandas) for creating a simple regression model. The R style formula creation is a major draw.

My data contains a field called 'ship_city' which can have any city from India. Since I am partitioning the data into train and test sets, there are several cities which appear only in one of the sets. A code snippet is given below:

df_train_Y, df_train_X = dmatrices(formula, data=df_train, return_type='dataframe')
df_train_Y_design_info, df_train_X_design_info = df_train_Y.design_info, df_train_X.design_info
df_test_Y, df_test_X = build_design_matrices([df_train_Y_design_info.builder, df_train_X_design_info.builder], df_test, return_type='dataframe')

The last line throws the following error:

patsy.PatsyError: Error converting data to categorical: observation with value 'Kolkata' does not match any of the expected levels

I believe this is a very common use case where training data will not have all levels of all categorical fields. Sklearn's DictVectorizer handles this quite well.

Is there any way I can make this work with Patsy?

like image 957
DaSarfyCode Avatar asked Dec 02 '15 06:12

DaSarfyCode


1 Answers

The problem of course is that if you just give patsy a raw list of values, it has no way to know that there are other values that could potentially happen as well. You have to somehow tell it what the complete set of possible values is.

One way is by using the levels= argument to C(...), like:

# If you have a data frame with all the data before splitting:
all_cities = sorted(df_all["Cities"].unique())
# Alternative approach:
all_cities = sorted(set(df_train["Cities"]).union(set(df_test["Cities"])))

dmatrices("y ~ C(Cities, levels=all_cities)", data=df_train)

Another option if you're using pandas's default categorical support is to record the set of possible values when you set up your data frame; if patsy detects that the object you've passed it is a pandas categorical then it automatically uses the pandas categories attribute instead of trying to guess what the possible categories are by looking at the data.

like image 75
Nathaniel J. Smith Avatar answered Sep 18 '22 14:09

Nathaniel J. Smith