Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Add RandomForestClassifier Predict_Proba Results to Original Dataframe

I'm a newbie working on my first 'real' ML algorithm. Apologies if this is duplicated but I can't find the answer on SO.

I've got the following dataframe (df):

index    Feature1  Feature2  Feature3  Target
001       01         01        03        0
002       03         03        01        1
003       03         02        02        1

My code looks something like this:

data = df[['Feature1', 'Feature2', 'Feature3']]
labels = df['Target']
X_train, X_test, y_train, y_test = train_test_split(data, labels, test_size = 0.8)

clf = RandomForestClassifier().fit(X_train, y_train)

prediction_of_probability = clf.predict_proba(X_test)

What I'm struggling with is how can I get the 'prediction_of_probability' back into the dataframe df?

I understand the predictions would not be for all items in the original dataframe.

Thank you in advance for helping a newbie like me!

like image 238
Python_Learner_DK Avatar asked Feb 23 '18 11:02

Python_Learner_DK


People also ask

What is the output of predict_proba?

Predict_proba will give the only probability of 1.

What does model predict_proba () do in Sklearn?

model. predict_proba() : For classification problems, some estimators also provide this method, which returns the probability that a new observation has each categorical label. In this case, the label with the highest probability is returned by model.

What does model predict_proba return?

The predict_proba() method The method accepts a single argument that corresponds to the data over which the probabilities will be computed and returns an array of lists containing the class probabilities for the input data points.

How does predict_proba work for Random Forest?

A Random Forest Classifier is a group of Decision Trees used. One class has probability 1, the other classes have probability 0. The Random Forest simply votes among the results. The predict_proba() returns the number of votes for each class, divided by the number of trees in the forest.


3 Answers

What you did is training the model. It means that with the features and the label you have, you train the model for future data. To test the quality of the model(selection of features for example), the model is tested on the X_test and y_test. In this case, you dont have future data, so you are not applying your model, you are just training it. You can see the quality of your model with AUC or ROC curves.

Anyway you can append the results to the dataframe in this way:

df_test = pd.DataFrame(X_test)
df_test['Target'] = y_test
df_test['prob_0'] = prediction_of_probability[:,0] 
df_test['prob_1'] = prediction_of_probability[:,1]
like image 189
Joe Avatar answered Oct 18 '22 14:10

Joe


You can try to keep the indices of the train and test and then put it all together this way:

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split

data = df[['Feature1', 'Feature2', 'Feature3']]
labels = df['Target']
indices = df.index.values 

# use the indices instead the labels to save the order of the split.

X_train, X_test,indices_train,indices_test = train_test_split(data,indices, test_size=0.33, random_state=42)

y_train, y_test = labels[indices_train],  labels[indices_test]


clf = RandomForestClassifier().fit(X_train, y_train)

prediction_of_probability = clf.predict_proba(X_test)

Then you can put the probabilities in the new df_new:

>>> df_new = df.copy()
>>> df_new.loc[indices_test,'pred_test'] = prediction_of_probability # clf.predict_proba(X_test)
>>> print(df_new)

   Feature1  Feature2  Feature3  Target  pred_test
1         3         3         1       1        NaN
2         3         2         2       1        NaN
0         1         1         3       0        1.0

And even the predictions for the train:

>>> df_new.loc[indices_train,'pred_train'] = clf.predict_proba(X_train)
>>> print(df_new)

   Feature1  Feature2  Feature3  Target  pred_test  pred_train
1         3         3         1       1        NaN         1.0
2         3         2         2       1        NaN         1.0
0         1         1         3       0        1.0         NaN

Or if you want to mix the probabilities of train and test, just use the same column name (i.e. pred).

like image 22
Mabel Villalba Avatar answered Oct 18 '22 15:10

Mabel Villalba


You need something like this:

# Create new dataframe to store test data.
df1 = pd.DataFrame(X_test)
df1['Target'] = y_test
df1['prob'] = prediction_of_probability[:,0]  

# Create another dataframe to store train data
df2 = pd.DataFrame(X_train)
df2['Target'] = y_train

# Append both dataframes
df = df1.append(df2).sort_index()
like image 1
Sociopath Avatar answered Oct 18 '22 14:10

Sociopath