Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Multiple output regression or classifier with one (or more) parameters with Python

I wrote a simple linear regression and decision tree classifier code with Python's Scikit-learn library for predicting the outcome. It works well.

My question is, Is there a way to do this backwards, to predict the best combination of parameter values based on imputed outcome (parameters, where accuracy will be the best).

Or I can ask like this, is there a classification, regression or some other type of algorithm (Decision tree, SVM, KNN, Logistic regression, Linear regression, Polynomial regression...) that can predict multiple outcomes based on one (or more) parameter/s?

I have tried to do this with putting multivariate outcome, but it shows the error:

ValueError: Expected 2D array, got 1D array instead: array=[101 905 182 268 646 624 465]. Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.

This is the code that I wrote for regression:

import pandas as pd
from sklearn import linear_model
from sklearn import tree

dic = {'par_1': [10, 30, 13, 19, 25, 33, 23],
       'par_2': [1, 3, 1, 2, 3, 3, 2],
       'outcome': [101, 905, 182, 268, 646, 624, 465]}

df = pd.DataFrame(dic)

variables = df.iloc[:,:-1]
results = df.iloc[:,-1]

regression = linear_model.LinearRegression()
regression.fit(variables, results)

input_values = [14, 2]

prediction = regression.predict([input_values])
prediction = round(prediction[0], 2)
print(prediction)

This is the code that I wrote for decision tree:

dic = {'par_1': [10, 30, 13, 19, 25, 33, 23],
       'par_2': [1, 3, 1, 2, 3, 3, 2],
       'outcome': ['yes', 'yes', 'no', 'yes', 'no', 'no', 'yes']}

df = pd.DataFrame(dic)

variables = df.iloc[:,:-1]
results = df.iloc[:,-1]

decision_tree = tree.DecisionTreeClassifier()
decision_tree.fit(variables, results)

input_values = [18, 2]

prediction = decision_tree.predict([input_values])[0]
print(prediction)
like image 993
taga Avatar asked Jun 08 '19 21:06

taga


2 Answers

As mentioned by @Justas, if you want to find the best combination of input values for which the output variable would be max/min, then it is a optimization problem.

There are quite a good range of non-linear optimizers available in scipy or you can go for meta-heuristics such Genetic Algorithm, Memetic algorithm, etc.

On the other hand, if your aim is to learn the inverse function, which maps the output variable into a set of input variables then the go for MultiOuputRegresssor or MultiOutputClassifier. Both of them can be used as a wrapper on top of any base estimators such as linearRegression, LogisticRegresssion, KNN, DecisionTree, SVM, etc.

Example:

import pandas as pd
from sklearn.multioutput import MultiOutputRegressor, RegressorChain
from sklearn.linear_model import LinearRegression


dic = {'par_1': [10, 30, 13, 19, 25, 33, 23],
       'par_2': [1, 3, 1, 2, 3, 3, 2],
       'outcome': [101, 905, 182, 268, 646, 624, 465]}

df = pd.DataFrame(dic)

variables = df.iloc[:,:-1]
results = df.iloc[:,-1]

multi_output_reg = MultiOutputRegressor(LinearRegression())
multi_output_reg.fit(results.values.reshape(-1, 1),variables)

multi_output_reg.predict([[100]])

# array([[12.43124217,  1.12571947]])
# sounds sensible according to the training data

#if input variables needs to be treated as categories,
# go for multiOutputClassifier
from sklearn.multioutput import MultiOutputClassifier
from sklearn.linear_model import LogisticRegression

multi_output_clf = MultiOutputClassifier(LogisticRegression(solver='lbfgs'))
multi_output_clf.fit(results.values.reshape(-1, 1),variables)

multi_output_clf.predict([[100]])

# array([[10,  1]])

In most situations, finding one of the input variable value can help in predicting other variables. This approach can be achieved by ClassifierChain or RegressorChain.

To understand the advantage of ClassifierChain, please refer to this example.

Update:


dic = {'par_1': [10, 30, 13, 19, 25, 33, 23],
       'par_2': [1, 3, 1, 2, 3, 3, 2],
       'outcome': [0, 1, 1, 1, 1, 1 , 0]}

df = pd.DataFrame(dic)

variables = df.iloc[:,:-1]
results = df.iloc[:,-1]

multi_output_clf = MultiOutputClassifier(LogisticRegression(solver='lbfgs',
                                                            multi_class='ovr'))
multi_output_clf.fit(results.values.reshape(-1, 1),variables)

multi_output_clf.predict([[1]])
# array([[13,  3]])

like image 188
Venkatachalam Avatar answered Oct 12 '22 23:10

Venkatachalam


You could frame the problem as an optimization problem.

Let your (trained) regression model input values be parameters to be searched.

Define the distance between the model's predicted price (at a given input combination) and the desired price (the price you want) as the cost function.

Then use one of the global optimization algorithms (e.g. genetic optimization) to find such input combination that minimizes the cost (i.e. predicted price is closest to your desired price).

like image 43
Justas Avatar answered Oct 13 '22 00:10

Justas