OLS Regression: Scikit vs. Statsmodels? [closed]

Tags:

Short version: I was using the scikit LinearRegression on some data, but I'm used to p-values so put the data into the statsmodels OLS, and although the R^2 is about the same the variable coefficients are all different by large amounts. This concerns me since the most likely problem is that I've made an error somewhere and now I don't feel confident in either output (since likely I have made one model incorrectly but don't know which one).

Longer version: Because I don't know where the issue is, I don't know exactly which details to include, and including everything is probably too much. I am also not sure about including code or data.

I am under the impression that scikit's LR and statsmodels OLS should both be doing OLS, and as far as I know OLS is OLS so the results should be the same.

For scikit's LR, the results are (statistically) the same whether or not I set normalize=True or =False, which I find somewhat strange.

For statsmodels OLS, I normalize the data using StandardScaler from sklearn. I add a column of ones so it includes an intercept (since scikit's output includes an intercept). More on that here: http://statsmodels.sourceforge.net/devel/examples/generated/example_ols.html (Adding this column did not change the variable coefficients to any notable degree and the intercept was very close to zero.) StandardScaler didn't like that my ints weren't floats, so I tried this: https://github.com/scikit-learn/scikit-learn/issues/1709 That makes the warning go away but the results are exactly the same.

Granted I'm using 5-folds cv for the sklearn approach (R^2 are consistent for both test and training data each time), and for statsmodels I just throw it all the data.

R^2 is about 0.41 for both sklearn and statsmodels (this is good for social science). This could be a good sign or just a coincidence.

The data is observations of avatars in WoW (from http://mmnet.iis.sinica.edu.tw/dl/wowah/) which I munged about to make it weekly with some different features. Originally this was a class project for a data science class.

Independent variables include number of observations in a week (int), character level (int), if in a guild (Boolean), when seen (Booleans on weekday day, weekday eve, weekday late, and the same three for weekend), a dummy for character class (at the time for the data collection, there were only 8 classes in WoW, so there are 7 dummy vars and the original string categorical variable is dropped), and others.

The dependent variable is how many levels each character gained during that week (int).

Interestingly, some of the relative order within like variables is maintained across statsmodels and sklearn. So, rank order of "when seen" is the same although the loadings are very different, and rank order for the character class dummies is the same although again the loadings are very different.

I think this question is similar to this one: Difference in Python statsmodels OLS and R's lm

I am good enough at Python and stats to make a go of it, but then not good enough to figure something like this out. I tried reading the sklearn docs and the statsmodels docs, but if the answer was there staring me in the face I did not understand it.

I would love to know:

Which output might be accurate? (Granted they might both be if I missed a kwarg.)
If I made a mistake, what is it and how to fix it?
Could I have figured this out without asking here, and if so how?

I know this question has some rather vague bits (no code, no data, no output), but I am thinking it is more about the general processes of the two packages. Sure, one seems to be more stats and one seems to be more machine learning, but they're both OLS so I don't understand why the outputs aren't the same.

(I even tried some other OLS calls to triangulate, one gave a much lower R^2, one looped for five minutes and I killed it, and one crashed.)

Thanks!

685

asked Feb 26 '14 22:02

Nat Poor

2 Answers

It sounds like you are not feeding the same matrix of regressors X to both procedures (but see below). Here's an example to show you which options you need to use for sklearn and statsmodels to produce identical results.

import numpy as np import statsmodels.api as sm from sklearn.linear_model import LinearRegression  # Generate artificial data (2 regressors + constant) nobs = 100  X = np.random.random((nobs, 2))  X = sm.add_constant(X) beta = [1, .1, .5]  e = np.random.random(nobs) y = np.dot(X, beta) + e   # Fit regression model sm.OLS(y, X).fit().params >> array([ 1.4507724 ,  0.08612654,  0.60129898])  LinearRegression(fit_intercept=False).fit(X, y).coef_ >> array([ 1.4507724 ,  0.08612654,  0.60129898])

As a commenter suggested, even if you are giving both programs the same X, X may not have full column rank, and they sm/sk could be taking (different) actions under-the-hood to make the OLS computation go through (i.e. dropping different columns).

I recommend you use pandas and patsy to take care of this:

import pandas as pd from patsy import dmatrices  dat = pd.read_csv('wow.csv') y, X = dmatrices('levels ~ week + character + guild', data=dat)

Or, alternatively, the statsmodels formula interface:

import statsmodels.formula.api as smf dat = pd.read_csv('wow.csv') mod = smf.ols('levels ~ week + character + guild', data=dat).fit()

Edit: This example might be useful: http://statsmodels.sourceforge.net/devel/example_formulas.html

answered Oct 04 '22 09:10

Vincent

If you use statsmodels, I would highly recommend using the statsmodels formula interface instead. You will get the same old result from OLS using the statsmodels formula interface as you would from sklearn.linear_model.LinearRegression, or R, or SAS, or Excel.

smod = smf.ols(formula ='y~ x', data=df) result = smod.fit() print(result.summary())

When in doubt, please

try reading the source code
try a different language for benchmark, or
try OLS from scratch, which is basic linear algebra.

answered Oct 04 '22 11:10

Sarah

Related questions
                            
                                Jupyter: can't create new notebook?
                            
                                SSLError: sslv3 alert handshake failure
                            
                                setup_requires with Cython?
                            
                                What is the difference between boto3 list_objects and list_objects_v2?
                            
                                Running a python package [duplicate]
                            
                                Function inside function - every time?
                            
                                Is Python *with* statement exactly equivalent to a try - (except) - finally block?
                            
                                Are PIP packages curated? Is it safe to install them?
                            
                                Can dask parralelize reading fom a csv file?
                            
                                Keras error : Expected to see 1 array
                            
                                aiohttp web.response body as json
                            
                                In the LinearRegression method in sklearn, what exactly is the fit_intercept parameter doing? [closed]
                            
                                Readline functionality on windows with python 2.7
                            
                                Improving FFT performance in Python
                            
                                How to Break Import Loop in python
                            
                                Bottle web framework - How to stop?
                            
                                Python Logging to Tkinter Text Widget
                            
                                3d Numpy array to 2d
                            
                                Is python Queue.queue get and put thread safe?
                            
                                How to download and write a file from Github using Requests

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

OLS Regression: Scikit vs. Statsmodels? [closed]

Tags:

python

scikit-learn

linear-regression

statsmodels

Nat Poor

People also ask

2 Answers

Vincent

Sarah

Recent Activity

Donate For Us