Say X1
and X2
are 2 pandas dataframes with the same columns, but possibly in different order. Assume model is some sort of sklearn
model, like LassoCV. Say I do model.fit(X1, y)
, and then model.predict(X2)
. Is the fact that the columns are in different order a problem, or does model save weights my name of column?
Also, same question, but what if X1
and X2
and numpy arrays?
Yes, I believe it will matter, as sklearn
will convert the pandas
DataFrame to an array of values (essentially calling X1.values
), and not pay attention to the column names. However, it's an easy fix. Just use:
X2 = X2[X1.columns]
And it will re-order X2
's columns to the same order as X1
The same is true of numpy
arrays, of course, because it will fit the model on the columns as they are in X1
, so when you predict on X2
, it will just predict based on the order of the columns in X1
Example:
Take these 2 dataframes:
>>> X1
a b
0 1 5
1 2 6
2 3 7
>>> X2
b a
0 5 3
1 4 2
2 6 1
The model is fit on X1.values
:
array([[1, 5],
[2, 6],
[3, 7]])
And you predict on X2.values
:
>>> X2.values
array([[5, 3],
[4, 2],
[6, 1]])
There is no way for the model to know that the columns are switched. So switch them manually:
X2 = X2[X1.columns]
>>> X2
a b
0 3 5
1 2 4
2 1 6
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With