Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to get scikit learn to find simple non-linear relationship

I have some data in a pandas dataframe (although pandas is not the point of this question). As an experiment I made column ZR as column Z divided by column R. As a first step using scikit learn I wanted to see if I could predict ZR from the other columns (which should be possible as I just made it from R and Z). My steps have been.

columns=['R','T', 'V', 'X', 'Z']
for c in columns:
    results[c] = preprocessing.scale(results[c]) 
results['ZR'] = preprocessing.scale(results['ZR'])
labels = results["ZR"].values
features = results[columns].values
#print labels
#print features
regr = linear_model.LinearRegression()
regr.fit(features, labels)
print(regr.coef_)
print np.mean((regr.predict(features)-labels)**2)

This gives

[ 0.36472515 -0.79579885 -0.16316067  0.67995378  0.59256197]
0.458552051342
  1. The preprocessing seems wrong as it destroys the Z/R relationship I think. What's the right way to preprocess in this situation?
  2. Is there some way to get near 100% accuracy? Linear regression is the wrong tool as the relationship is not-linear.
  3. The five features are highly correlated in my data. Is non-negative least squares implemented in scikit learn ? ( I can see it mentioned in the mailing list but not the docs.) My aim would be to get as many coefficients set to zero as possible.
like image 949
graffe Avatar asked Mar 07 '14 16:03

graffe


1 Answers

You should easily be able to get a decent fit using random forest regression, without any preprocessing, since it is a nonlinear method:

model = RandomForestRegressor(n_estimators=10, max_features=2)
model.fit(features, labels)

You can play with the parameters to get better performance.

like image 181
cfh Avatar answered Sep 22 '22 03:09

cfh