I'm a newbie working on my first 'real' ML algorithm. Apologies if this is duplicated but I can't find the answer on SO.
I've got the following dataframe (df
):
index Feature1 Feature2 Feature3 Target
001 01 01 03 0
002 03 03 01 1
003 03 02 02 1
My code looks something like this:
data = df[['Feature1', 'Feature2', 'Feature3']]
labels = df['Target']
X_train, X_test, y_train, y_test = train_test_split(data, labels, test_size = 0.8)
clf = RandomForestClassifier().fit(X_train, y_train)
prediction_of_probability = clf.predict_proba(X_test)
What I'm struggling with is how can I get the 'prediction_of_probability'
back into the dataframe df
?
I understand the predictions would not be for all items in the original dataframe.
Thank you in advance for helping a newbie like me!
Predict_proba will give the only probability of 1.
model. predict_proba() : For classification problems, some estimators also provide this method, which returns the probability that a new observation has each categorical label. In this case, the label with the highest probability is returned by model.
The predict_proba() method The method accepts a single argument that corresponds to the data over which the probabilities will be computed and returns an array of lists containing the class probabilities for the input data points.
A Random Forest Classifier is a group of Decision Trees used. One class has probability 1, the other classes have probability 0. The Random Forest simply votes among the results. The predict_proba() returns the number of votes for each class, divided by the number of trees in the forest.
What you did is training the model. It means that with the features and the label you have, you train the model for future data. To test the quality of the model(selection of features for example), the model is tested on the X_test and y_test. In this case, you dont have future data, so you are not applying your model, you are just training it. You can see the quality of your model with AUC or ROC curves.
Anyway you can append the results to the dataframe in this way:
df_test = pd.DataFrame(X_test)
df_test['Target'] = y_test
df_test['prob_0'] = prediction_of_probability[:,0]
df_test['prob_1'] = prediction_of_probability[:,1]
You can try to keep the indices of the train and test and then put it all together this way:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
data = df[['Feature1', 'Feature2', 'Feature3']]
labels = df['Target']
indices = df.index.values
# use the indices instead the labels to save the order of the split.
X_train, X_test,indices_train,indices_test = train_test_split(data,indices, test_size=0.33, random_state=42)
y_train, y_test = labels[indices_train], labels[indices_test]
clf = RandomForestClassifier().fit(X_train, y_train)
prediction_of_probability = clf.predict_proba(X_test)
Then you can put the probabilities in the new df_new
:
>>> df_new = df.copy()
>>> df_new.loc[indices_test,'pred_test'] = prediction_of_probability # clf.predict_proba(X_test)
>>> print(df_new)
Feature1 Feature2 Feature3 Target pred_test
1 3 3 1 1 NaN
2 3 2 2 1 NaN
0 1 1 3 0 1.0
And even the predictions for the train:
>>> df_new.loc[indices_train,'pred_train'] = clf.predict_proba(X_train)
>>> print(df_new)
Feature1 Feature2 Feature3 Target pred_test pred_train
1 3 3 1 1 NaN 1.0
2 3 2 2 1 NaN 1.0
0 1 1 3 0 1.0 NaN
Or if you want to mix the probabilities of train and test, just use the same column name (i.e. pred
).
You need something like this:
# Create new dataframe to store test data.
df1 = pd.DataFrame(X_test)
df1['Target'] = y_test
df1['prob'] = prediction_of_probability[:,0]
# Create another dataframe to store train data
df2 = pd.DataFrame(X_train)
df2['Target'] = y_train
# Append both dataframes
df = df1.append(df2).sort_index()
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With