Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to get original values after using factorize() in Python?

I'm a beginner trying to create a predictive model using Random Forest in Python, using train and test datasets. train["ALLOW/BLOCK"] can take 1 out of 4 expected values (all strings). test["ALLOW/BLOCK"] is what needs to be predicted.

y,_ = pd.factorize(train["ALLOW/BLOCK"])

y
Out[293]: array([0, 1, 0, ..., 1, 0, 2], dtype=int64)

I used predict for the prediction.

clf.predict(test[features])

clf.predict(test[features])[0:10]
Out[294]: array([0, 0, 0, 0, 0, 2, 2, 0, 0, 0], dtype=int64)

How can I get the original values instead of the numeric ones? Is the following code actually comparing the actual and predicted values?

z,_= pd.factorize(test["AUDIT/BLOCK"])

z==clf.predict(test[features])
Out[296]: array([ True, False, False, ..., False, False, False], dtype=bool) 
like image 276
Parvathy Sarat Avatar asked Sep 09 '17 19:09

Parvathy Sarat


People also ask

What does the PD factorize () function do?

factorize() method helps to get the numeric representation of an array by identifying distinct values.


1 Answers

First, you need to save the label returned by pd.factorize as follows:

y, label = pd.factorize(train["ALLOW/BLOCK"])

And then after you got the numeric predictions, you can extract the corresponding labels by label[pred]:

pred = clf.predict(test[features])
pred_label = label[pred]

pred_label contains predictions with the original values.


No you should not re factorize the test predictions, since very likely the label would be different. Consider the following example:

pd.factorize(['a', 'b', 'c'])
# (array([0, 1, 2]), array(['a', 'b', 'c'], dtype=object))

pd.factorize(['c', 'a', 'b'])
# (array([0, 1, 2]), array(['c', 'a', 'b'], dtype=object))

So the label depends on the order of the elements.

like image 56
Psidom Avatar answered Oct 18 '22 20:10

Psidom