Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

OLS Regression: Scikit vs. Statsmodels? [closed]

Short version: I was using the scikit LinearRegression on some data, but I'm used to p-values so put the data into the statsmodels OLS, and although the R^2 is about the same the variable coefficients are all different by large amounts. This concerns me since the most likely problem is that I've made an error somewhere and now I don't feel confident in either output (since likely I have made one model incorrectly but don't know which one).

Longer version: Because I don't know where the issue is, I don't know exactly which details to include, and including everything is probably too much. I am also not sure about including code or data.

I am under the impression that scikit's LR and statsmodels OLS should both be doing OLS, and as far as I know OLS is OLS so the results should be the same.

For scikit's LR, the results are (statistically) the same whether or not I set normalize=True or =False, which I find somewhat strange.

For statsmodels OLS, I normalize the data using StandardScaler from sklearn. I add a column of ones so it includes an intercept (since scikit's output includes an intercept). More on that here: http://statsmodels.sourceforge.net/devel/examples/generated/example_ols.html (Adding this column did not change the variable coefficients to any notable degree and the intercept was very close to zero.) StandardScaler didn't like that my ints weren't floats, so I tried this: https://github.com/scikit-learn/scikit-learn/issues/1709 That makes the warning go away but the results are exactly the same.

Granted I'm using 5-folds cv for the sklearn approach (R^2 are consistent for both test and training data each time), and for statsmodels I just throw it all the data.

R^2 is about 0.41 for both sklearn and statsmodels (this is good for social science). This could be a good sign or just a coincidence.

The data is observations of avatars in WoW (from http://mmnet.iis.sinica.edu.tw/dl/wowah/) which I munged about to make it weekly with some different features. Originally this was a class project for a data science class.

Independent variables include number of observations in a week (int), character level (int), if in a guild (Boolean), when seen (Booleans on weekday day, weekday eve, weekday late, and the same three for weekend), a dummy for character class (at the time for the data collection, there were only 8 classes in WoW, so there are 7 dummy vars and the original string categorical variable is dropped), and others.

The dependent variable is how many levels each character gained during that week (int).

Interestingly, some of the relative order within like variables is maintained across statsmodels and sklearn. So, rank order of "when seen" is the same although the loadings are very different, and rank order for the character class dummies is the same although again the loadings are very different.

I think this question is similar to this one: Difference in Python statsmodels OLS and R's lm

I am good enough at Python and stats to make a go of it, but then not good enough to figure something like this out. I tried reading the sklearn docs and the statsmodels docs, but if the answer was there staring me in the face I did not understand it.

I would love to know:

  1. Which output might be accurate? (Granted they might both be if I missed a kwarg.)
  2. If I made a mistake, what is it and how to fix it?
  3. Could I have figured this out without asking here, and if so how?

I know this question has some rather vague bits (no code, no data, no output), but I am thinking it is more about the general processes of the two packages. Sure, one seems to be more stats and one seems to be more machine learning, but they're both OLS so I don't understand why the outputs aren't the same.

(I even tried some other OLS calls to triangulate, one gave a much lower R^2, one looped for five minutes and I killed it, and one crashed.)

Thanks!

like image 685
Nat Poor Avatar asked Feb 26 '14 22:02

Nat Poor


People also ask

What is the difference between statsmodels and Sklearn linear regression?

Linear regression is in its basic form the same in statsmodels and in scikit-learn. However, the implementation differs which might produce different results in edge cases, and scikit learn has in general more support for larger models. For example, statsmodels currently uses sparse matrices in very few parts.

Is statsmodels better than Sklearn?

While scikit-learn is slightly faster than statsmodels for 1,000 or less observations, this difference is not significant per the t-test analysis. Sci-kit learn is significantly faster for datasets with more than 1,000 observations.

Does Sklearn use OLS?

Hi, i just wanted to add here, that in terms of sklearn, it does not use OLS method for linear regression under the hood. Since sklearn comes from the data-mining/machine-learning realm, they like to use Steepest Descent Gradient algorithm.

What is OLS Statsmodel?

The OLS() function of the statsmodels. api module is used to perform OLS regression. It returns an OLS object. Then fit() method is called on this object for fitting the regression line to the data. The summary() method is used to obtain a table which gives an extensive description about the regression results.


2 Answers

It sounds like you are not feeding the same matrix of regressors X to both procedures (but see below). Here's an example to show you which options you need to use for sklearn and statsmodels to produce identical results.

import numpy as np import statsmodels.api as sm from sklearn.linear_model import LinearRegression  # Generate artificial data (2 regressors + constant) nobs = 100  X = np.random.random((nobs, 2))  X = sm.add_constant(X) beta = [1, .1, .5]  e = np.random.random(nobs) y = np.dot(X, beta) + e   # Fit regression model sm.OLS(y, X).fit().params >> array([ 1.4507724 ,  0.08612654,  0.60129898])  LinearRegression(fit_intercept=False).fit(X, y).coef_ >> array([ 1.4507724 ,  0.08612654,  0.60129898]) 

As a commenter suggested, even if you are giving both programs the same X, X may not have full column rank, and they sm/sk could be taking (different) actions under-the-hood to make the OLS computation go through (i.e. dropping different columns).

I recommend you use pandas and patsy to take care of this:

import pandas as pd from patsy import dmatrices  dat = pd.read_csv('wow.csv') y, X = dmatrices('levels ~ week + character + guild', data=dat) 

Or, alternatively, the statsmodels formula interface:

import statsmodels.formula.api as smf dat = pd.read_csv('wow.csv') mod = smf.ols('levels ~ week + character + guild', data=dat).fit() 

Edit: This example might be useful: http://statsmodels.sourceforge.net/devel/example_formulas.html

like image 68
Vincent Avatar answered Oct 04 '22 09:10

Vincent


If you use statsmodels, I would highly recommend using the statsmodels formula interface instead. You will get the same old result from OLS using the statsmodels formula interface as you would from sklearn.linear_model.LinearRegression, or R, or SAS, or Excel.

smod = smf.ols(formula ='y~ x', data=df) result = smod.fit() print(result.summary()) 

When in doubt, please

  1. try reading the source code
  2. try a different language for benchmark, or
  3. try OLS from scratch, which is basic linear algebra.
like image 33
Sarah Avatar answered Oct 04 '22 11:10

Sarah