I have a dataframe 'RPT' indexed by (STK_ID,RPT_Date), contains the accumulated sales of stocks for each qurter:
sales
STK_ID RPT_Date
000876 20060331 798627000
20060630 1656110000
20060930 2719700000
20061231 3573660000
20070331 878415000
20070630 2024660000
20070930 3352630000
20071231 4791770000
600141 20060331 270912000
20060630 658981000
20060930 1010270000
20061231 1591500000
20070331 319602000
20070630 790670000
20070930 1250530000
20071231 1711240000
I want to calculate the single qurterly sales using 'groupby' by STK_ID & RPT_Yr ,such as : RPT.groupby('STK_ID','RPT_Yr')['sales'].transform(lambda x: x-x.shift(1))
, how to do that ?
suppose I can get the year by lambda x : datetime.strptime(x, '%Y%m%d').year
Creating a MultiIndex (hierarchical index) object A MultiIndex can be created from a list of arrays (using MultiIndex. from_arrays() ), an array of tuples (using MultiIndex. from_tuples() ), a crossed set of iterables (using MultiIndex. from_product() ), or a DataFrame (using MultiIndex.
How to perform groupby index in pandas? Pass index name of the DataFrame as a parameter to groupby() function to group rows on an index. DataFrame. groupby() function takes string or list as a param to specify the group columns or index.
Hierarchical indexing is one of the functions in pandas, a software library for the Python programming languages. pandas derives its name from the term “panel data”, a statistical term for four-dimensional data models that show changes over time.
Assuming here that RPT_Data is a string, any reason why not to use Datetime?
It is possible to groupby using functions, but only on a non MultiIndex-index. Working around this by resetting the index, and set 'RPT_Date' as index to extract the year (note: pandas toggles between object and int as dtype for 'RPT_Date').
In [135]: year = lambda x : datetime.strptime(str(x), '%Y%m%d').year
In [136]: grouped = RPT.reset_index().set_index('RPT_Date').groupby(['STK_ID', year])
In [137]: for key, df in grouped:
.....: print key
.....: print df
.....:
(876, 2006)
STK_ID sales
RPT_Date
20060331 876 798627000
20060630 876 1656110000
20060930 876 2719700000
20061231 876 3573660000
(876, 2007)
STK_ID sales
RPT_Date
20070331 876 878415000
20070630 876 2024660000
20070930 876 3352630000
20071231 876 4791770000
(600141, 2006)
STK_ID sales
RPT_Date
20060331 600141 270912000
20060630 600141 658981000
20060930 600141 1010270000
20061231 600141 1591500000
(600141, 2007)
STK_ID sales
RPT_Date
20070331 600141 319602000
20070630 600141 790670000
20070930 600141 1250530000
20071231 600141 1711240000
Other option is to use a tmp column
In [153]: RPT_tmp = RPT.reset_index()
In [154]: RPT_tmp['year'] = RPT_tmp['RPT_Date'].apply(year)
In [155]: grouped = RPT_tmp.groupby(['STK_ID', 'year'])
EDIT Reorganising your frame make it much easier.
In [48]: RPT
Out[48]:
sales
STK_ID RPT_Year RPT_Quarter
876 2006 0 798627000
1 1656110000
2 2719700000
3 3573660000
2007 0 878415000
1 2024660000
2 3352630000
3 4791770000
600141 2006 0 270912000
1 658981000
2 1010270000
3 1591500000
2007 0 319602000
1 790670000
2 1250530000
3 1711240000
In [49]: RPT.groupby(level=['STK_ID', 'RPT_Year'])['sales'].apply(sale_per_q)
Out[49]:
STK_ID RPT_Year RPT_Quarter
876 2006 0 798627000
1 857483000
2 1063590000
3 853960000
2007 0 878415000
1 1146245000
2 1327970000
3 1439140000
600141 2006 0 270912000
1 388069000
2 351289000
3 581230000
2007 0 319602000
1 471068000
2 459860000
3 460710000
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With