I have a dataframe that has about 370 columns. I'm testing a series of hypothesis that require me to use subsets of the model to fit a cubic regression model. I'm planning on using statsmodels to model this data.
Part of the process for polynomial regression involves mean centering variables (subtracting the mean from every case for a particular feature).
I can do this with 3 lines of code but it seems inefficient, given that I need to replicate this process for half a dozen hypothesis. Keep in mind that I need to data at the coefficient level from the statsmodel output so I need to retain the column names.
Here's a peek at the data. It's the sub-set of columns I need for one of my hypothesis tests.
i we you shehe they ipron
0 0.51 0 0 0.26 0.00 1.02
1 1.24 0 0 0.00 0.00 1.66
2 0.00 0 0 0.00 0.72 1.45
3 0.00 0 0 0.00 0.00 0.53
Here is the code that mean centers and keeps the column names.
from sklearn import preprocessing
#create df of features for hypothesis, from full dataframe
h2 = df[['i', 'we', 'you', 'shehe', 'they', 'ipron']]
#center the variables
x_centered = preprocessing.scale(h2, with_mean='True', with_std='False')
#convert back into a Pandas dataframe and add column names
x_centered_df = pd.DataFrame(x_centered, columns=h2.columns)
Any recommendations on how to make this more efficient / faster would be awesome!
df.apply(lambda x: x-x.mean())
%timeit df.apply(lambda x: x-x.mean())
1000 loops, best of 3: 2.09 ms per loop
df.subtract(df.mean())
%timeit df.subtract(df.mean())
1000 loops, best of 3: 902 µs per loop
both yielding:
i we you shehe they ipron
0 0.0725 0 0 0.195 -0.18 -0.145
1 0.8025 0 0 -0.065 -0.18 0.495
2 -0.4375 0 0 -0.065 0.54 0.285
3 -0.4375 0 0 -0.065 -0.18 -0.635
I know this question is a little old, but by now Scikit is the fastest solution. Plus, you can condense the code in one line:
pd.DataFrame(preprocessing.scale(df, with_mean=True, with_std=False),columns = df.columns)
%timeit pd.DataFrame(preprocessing.scale(df, with_mean=True, with_std=False),columns = df.columns)
684 µs ± 30.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
test.subtract(df.mean())
%timeit df.subtract(df.mean())
1.63 ms ± 107 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
df I used for testing:
df = pd.DataFrame(np.random.randint(low=1, high=10, size=(20,5)),columns = list('abcde'))
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With