Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Sklearn fit vs predict, order of columns matters?

Say X1 and X2 are 2 pandas dataframes with the same columns, but possibly in different order. Assume model is some sort of sklearn model, like LassoCV. Say I do model.fit(X1, y), and then model.predict(X2). Is the fact that the columns are in different order a problem, or does model save weights my name of column?

Also, same question, but what if X1 and X2 and numpy arrays?

like image 447
Baron Yugovich Avatar asked Aug 02 '18 22:08

Baron Yugovich


1 Answers

Yes, I believe it will matter, as sklearn will convert the pandas DataFrame to an array of values (essentially calling X1.values), and not pay attention to the column names. However, it's an easy fix. Just use:

X2 = X2[X1.columns]

And it will re-order X2's columns to the same order as X1

The same is true of numpy arrays, of course, because it will fit the model on the columns as they are in X1, so when you predict on X2, it will just predict based on the order of the columns in X1

Example:

Take these 2 dataframes:

>>> X1
   a  b
0  1  5
1  2  6
2  3  7

>>> X2
   b  a
0  5  3
1  4  2
2  6  1

The model is fit on X1.values:

array([[1, 5],
       [2, 6],
       [3, 7]])

And you predict on X2.values:

>>> X2.values
array([[5, 3],
       [4, 2],
       [6, 1]])

There is no way for the model to know that the columns are switched. So switch them manually:

X2 = X2[X1.columns]

>>> X2
   a  b
0  3  5
1  2  4
2  1  6
like image 123
sacuL Avatar answered Sep 19 '22 12:09

sacuL