Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to persist patsy DesignInfo?

I'm working on an application that is a "predictive-model-as-a-service", structured as follows:

  • train a model offline
  • periodically upload model parameters to a "prediction server"
  • the prediction server takes as input a single observation, and outputs a prediction

I'm trying to use patsy, but running into the following problem: When a single prediction comes in, how do I convert it to the right shape such that it looks like a row of the training data?

The patsy documentation provides an example when the DesignInfo from the training data is available in memory: http://patsy.readthedocs.io/en/latest/library-developers.html#predictions

# offline model training import patsy  data = {'animal': ['cat', 'cat', 'dog', 'raccoon'], 'cuteness': [3, 6, 10, 4]} eq_string = "cuteness ~ animal"   dmats = patsy.dmatrices(eq_string,data) design_info = dmats[1].design_info train_model(dmats)   # online predictions input_data = {'animal': ['raccoon']}  # if the DesignInfo were available, I could do this: new_dmat = build_design_matrices([design_info], input_data) make_prediction(new_dmat, trained_model) 

And then the output:

[DesignMatrix with shape (1, 3)    Intercept  animal[T.dog]  animal[T.raccoon]            1              0                  1    Terms:      'Intercept' (column 0)      'animal' (columns 1:3)] 

Notice that this row is the same shape as the training data; it has a column for animal[T.dog]. In my application, I don't have a way to access the DesignInfo to build the DesignMatrix for the new data. Concretely, how would the prediction server know how many other categories of animal are in the training data and in what order?

I thought I could just pickle it but it turns out this isn't supported yet: https://github.com/pydata/patsy/issues/26

I could also simply persist the matrix columns as a string and rebuild the matrix from that online, but this seems a bit fragile.

Is there a good way to do this?

like image 234
exp1orer Avatar asked Apr 30 '16 18:04

exp1orer


People also ask

What is a patsy formula?

Patsy can also be used to generate matrices that describe the relationships between variables. A simple y =mx +b linear model will produce two matrices : one with the values of y and one with the variables on the right side of the equation plus an optional intercept that is just a column of 1s.

What is Dmatrices?

Data Matrix used in XGBoost. DMatrix is an internal data structure that is used by XGBoost, which is optimized for both memory efficiency and training speed. You can construct DMatrix from multiple different sources of data.


1 Answers

Assuming your goal is to be able to restart the server without retraining, it looks like your best option (until patsy implements pickling) would be to pickle data, eq_string and whatever parameters are calculated by train_model. Then upon restarting the server, you could unpickle data and eq_string and call dmats = patsy.dmatrices(eq_string,data) again. This should run pretty fast, since it's not really training a model, just preprocessing your data. Then you would also unpickle the parameters calculated by train_model (not shown in the question), and the server should be ready to make predictions for new inputs.

Note that if you are splitting this into client and server components, the server should do everything discussed above, and the client should just send it the input_data defined in your question. (The client doesn't ever need to see dmats or design_info.)

like image 145
Matthias Fripp Avatar answered Sep 28 '22 13:09

Matthias Fripp