I need some directions in grouping a Pandas DateFrame object by year or month and get in return an new DateFrame object with a new index.
Here is my code so far. groupby works as intended.
Load data from .csv file, parse 'Date' to date format (historical stock quotes from finance.yahoo.com)
In [23]: import pandas as pd
file = pd.read_csv("sdf.de.csv", parse_dates=['Date'])
file.head(2)
Out[23]:
Date Open High Low Close Volume Adj Close
0 2016-02-16 18.650 18.70 17.940 18.16 1720800 17.0600
1 2016-02-15 18.295 18.64 18.065 18.50 1463500 17.3794
sort file for 'Date' ascending and set index to Date
In [24]: daily = file.sort_values(by='Date').set_index('Date')
daily.head()
Out[24]:
Open High Low Close Volume Adj Close
Date
2000-01-03 14.20 14.50 14.15 14.40 277400 2.7916
2000-01-04 14.29 14.30 13.90 14.15 109200 2.7431
grouping for month
I would do an additional apply() to the groups, which would condense the data for the specific group, e.g. find the highest High value for the year/month or sum() the Volume values. This step is omitted for this example.
In [39]: monthly = daily.groupby(lambda x: (x.year, x.month))
monthly.first()
Out[39]:
Open High Low Close Volume Adj Close
(2000, 1) 14.200 14.500 14.150 14.400 277400 2.7916
(2000, 2) 13.900 14.390 13.900 14.250 287200 2.7625
... ... ... ... ... ... ...
(2016, 1) 23.620 23.620 23.620 23.620 0 22.1893
(2016, 2) 19.575 19.630 19.140 19.450 1783000 18.2719
This works, but it gives me a DateFrame object with a tuple as index.
The desired result, in this case for grouping for month, would be a complete new DataFrame object, but the Date index should be a new DatetimeIndex in the form of %Y-%m or just %Y if grouped by year.
Out[39]:
Open High Low Close Volume Adj Close
Date
2000-01 14.200 14.500 14.150 14.400 277400 2.7916
2000-02 13.900 14.390 13.900 14.250 287200 2.7625
... ... ... ... ... ... ...
2016-01 23.620 23.620 23.620 23.620 0 22.1893
2016-02 19.575 19.630 19.140 19.450 1783000 18.2719
I'm thankful for any directions.
You can use groupby with daily.index.year, daily.index.month or change index to_period and then groupby by index:
print daily
Open High Low Close Volume Adj Close
Date
2000-01-01 14.200 14.50 14.15 14.40 277400 2.7916
2000-02-01 13.900 14.39 13.90 14.25 287200 2.7625
2016-01-01 23.620 23.62 23.62 23.62 0 22.1893
2016-02-01 19.575 19.63 19.14 19.45 1783000 18.2719
print daily.groupby([daily.index.year, daily.index.month]).first()
Open High Low Close Volume Adj Close
2000 1 14.200 14.50 14.15 14.40 277400 2.7916
2 13.900 14.39 13.90 14.25 287200 2.7625
2016 1 23.620 23.62 23.62 23.62 0 22.1893
2 19.575 19.63 19.14 19.45 1783000 18.2719
daily.index = daily.index.to_period('M')
print daily.groupby(daily.index).first()
Open High Low Close Volume Adj Close
Date
2000-01 14.200 14.50 14.15 14.40 277400 2.7916
2000-02 13.900 14.39 13.90 14.25 287200 2.7625
2016-01 23.620 23.62 23.62 23.62 0 22.1893
2016-02 19.575 19.63 19.14 19.45 1783000 18.2719
You can use a list comprehension to access the year and month accessor variable from your timestamps and then group on those.
>>> df.groupby([[d.year for d in df.Date], [d.month for d in df.Date]]).first()
Date Open High Low Close Volume Adj_Close
2000 1 2000-01-01 14.200 14.50 14.15 14.40 277400 2.7916
2 2000-02-01 13.900 14.39 13.90 14.25 287200 2.7625
2016 1 2016-01-01 23.620 23.62 23.62 23.62 0 22.1893
2 2016-02-01 19.575 19.63 19.14 19.45 1783000 18.2719
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With