I'm looking to check trends for a number of entities (SysNr)
I have data spanning 3 years (2014,2015,2016)
I'm looking at a large quantity of variables, but will limit this question to one ('res_f_r')
My DataFrame looks something like this
d = [
{'RegnskabsAar': 2014, 'SysNr': 1, 'res_f_r': 350000},
{'RegnskabsAar': 2015, 'SysNr': 1, 'res_f_r': 400000},
{'RegnskabsAar': 2016, 'SysNr': 1, 'res_f_r': 450000},
{'RegnskabsAar': 2014, 'SysNr': 2, 'res_f_r': 350000},
{'RegnskabsAar': 2015, 'SysNr': 2, 'res_f_r': 300000},
{'RegnskabsAar': 2016, 'SysNr': 2, 'res_f_r': 250000},
]
df = pd.DataFrame(d)
RegnskabsAar SysNr res_f_r
0 2014 1 350000
1 2015 1 400000
2 2016 1 450000
3 2014 2 350000
4 2015 2 300000
5 2016 2 250000
My desire is to do a linear regression on each entity (SysNr) and get returned the slope and intercept
My desired output for the above is
SysNr intercept slope
0 1 300000 50000
1 2 400000 -50000
Any ideas?
So I don't know why our intercept values differ (maybe I have made a mistake or your given data is not the full data you expect to work on), but I'd suggest you to use np.polyfit
or the tool of your choice (scikit-learn, scipy.stats.linregress, ...) in combination with groupby and apply:
In [25]: df.groupby("SysNr").apply(lambda g: np.polyfit(g.RegnskabsAar, g.res_f_r, 1))
Out[25]:
SysNr
1 [49999.99999999048, -100349999.99998075]
2 [-49999.99999999045, 101049999.99998072]
dtype: object
After that, beautify it:
In [43]: df.groupby("SysNr").apply(
...: lambda g: np.polyfit(g.RegnskabsAar, g.res_f_r, 1)).apply(
...: pd.Series).rename(columns={0:'slope', 1:'intercept'}).reset_index()
Out[43]:
SysNr slope intercept
0 1 50000.0 -1.003500e+08
1 2 -50000.0 1.010500e+08
Because you asked on the other answer in the comment about how to handle missing years for some SysNr
:
Just drop that NaNs
for a valid linear regression. Of course you could also fill them with the mean or so, depending on what you want to achieve, but that isn't that helpful from my point of view.
If the entity has only data for one year, you can't apply a linear regression on that usefully. But you can (if you want and that fits your case, please provide more information on the data if needed) extrapolate somehow the slope of the other entities to this one and calculate the intercept. For that of course you must make some assumptions on the distribution of the slope of the entities (e.g. linear, then the slope of sysNr 3 would be -150000.0
).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With