Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Simple linear regression using pandas dataframe

I'm looking to check trends for a number of entities (SysNr)

I have data spanning 3 years (2014,2015,2016)

I'm looking at a large quantity of variables, but will limit this question to one ('res_f_r')

My DataFrame looks something like this

d = [
    {'RegnskabsAar': 2014, 'SysNr': 1, 'res_f_r': 350000},
    {'RegnskabsAar': 2015, 'SysNr': 1, 'res_f_r': 400000},
    {'RegnskabsAar': 2016, 'SysNr': 1, 'res_f_r': 450000},
    {'RegnskabsAar': 2014, 'SysNr': 2, 'res_f_r': 350000},
    {'RegnskabsAar': 2015, 'SysNr': 2, 'res_f_r': 300000},
    {'RegnskabsAar': 2016, 'SysNr': 2, 'res_f_r': 250000},
]

df = pd.DataFrame(d)



   RegnskabsAar  SysNr  res_f_r
0          2014      1   350000
1          2015      1   400000
2          2016      1   450000
3          2014      2   350000
4          2015      2   300000
5          2016      2   250000

My desire is to do a linear regression on each entity (SysNr) and get returned the slope and intercept

My desired output for the above is

   SysNr  intercept  slope
0      1     300000  50000
1      2     400000 -50000

Any ideas?

like image 941
Henrik Poulsen Avatar asked Feb 13 '18 13:02

Henrik Poulsen


1 Answers

So I don't know why our intercept values differ (maybe I have made a mistake or your given data is not the full data you expect to work on), but I'd suggest you to use np.polyfit or the tool of your choice (scikit-learn, scipy.stats.linregress, ...) in combination with groupby and apply:

In [25]: df.groupby("SysNr").apply(lambda g: np.polyfit(g.RegnskabsAar, g.res_f_r, 1))
Out[25]:
SysNr
1    [49999.99999999048, -100349999.99998075]
2    [-49999.99999999045, 101049999.99998072]
dtype: object

After that, beautify it:

In [43]: df.groupby("SysNr").apply(
    ...:     lambda g: np.polyfit(g.RegnskabsAar, g.res_f_r, 1)).apply(
    ...:     pd.Series).rename(columns={0:'slope', 1:'intercept'}).reset_index()
Out[43]:
   SysNr    slope     intercept
0      1  50000.0 -1.003500e+08
1      2 -50000.0  1.010500e+08

Edit:

Because you asked on the other answer in the comment about how to handle missing years for some SysNr: Just drop that NaNs for a valid linear regression. Of course you could also fill them with the mean or so, depending on what you want to achieve, but that isn't that helpful from my point of view.

If the entity has only data for one year, you can't apply a linear regression on that usefully. But you can (if you want and that fits your case, please provide more information on the data if needed) extrapolate somehow the slope of the other entities to this one and calculate the intercept. For that of course you must make some assumptions on the distribution of the slope of the entities (e.g. linear, then the slope of sysNr 3 would be -150000.0).

like image 196
Nico Albers Avatar answered Oct 12 '22 23:10

Nico Albers