I have a dataset of daily data. I need to get only the data of the first day of each month in the data set (The data is from 1972 to 2013). So for example I would need Index <code>20</code>, Date <code>2013-12-02</code> value of <code>0.1555</code> to be extracted. The problem I have is that the first day for each month is different, so I cannot use a step such as <code>relativedelta(months=1)</code>, how would I go about of extracting these values from my dataset? Is there a similar command as I have found in another post for R? R - XTS: Get the first dates and values for each month from a daily time series with missing rows <pre class="prettyprint"><code>17 2013-12-05 0.1621 18 2013-12-04 0.1698 19 2013-12-03 0.1516 20 2013-12-02 0.1555 21 2013-11-29 0.1480 22 2013-11-27 0.1487 23 2013-11-26 0.1648 </code></pre>

I would groupby the month and then get the zeroth (nth) row of each group. First set as index (I think this is necessary): <pre class="prettyprint"><code>In [11]: df1 = df.set_index('date') In [12]: df1 Out[12]: n val date 2013-12-05 17 0.1621 2013-12-04 18 0.1698 2013-12-03 19 0.1516 2013-12-02 20 0.1555 2013-11-29 21 0.1480 2013-11-27 22 0.1487 2013-11-26 23 0.1648 </code></pre> Next sort, so that the first element is the first date of that month (Note: this doesn't appear to be necessary for nth, but I think that's actually a bug!): <pre class="prettyprint"><code>In [13]: df1.sort_index(inplace=True) In [14]: df1.groupby(pd.TimeGrouper('M')).nth(0) Out[14]: n val date 2013-11-26 23 0.1648 2013-12-02 20 0.1555 </code></pre> another option is to resample and take the first entry: <pre class="prettyprint"><code>In [15]: df1.resample('M', 'first') Out[15]: n val date 2013-11-30 23 0.1648 2013-12-31 20 0.1555 </code></pre> <hr> Thinking about this, you can do this much simpler by extracting the month and then grouping by that: <pre class="prettyprint"><code>In [21]: pd.DatetimeIndex(df.date).to_period('M') Out[21]: <class 'pandas.tseries.period.PeriodIndex'> [2013-12, ..., 2013-11] Length: 7, Freq: M In [22]: df.groupby(pd.DatetimeIndex(df.date).to_period('M')).nth(0) Out[22]: n date val 0 17 2013-12-05 0.1621 4 21 2013-11-29 0.1480 </code></pre> This time the sortedness of <code>df.date</code> is (correctly) relevant, if you know it's in descending date order you can use <code>nth(-1)</code>: <pre class="prettyprint"><code>In [23]: df.groupby(pd.DatetimeIndex(df.date).to_period('M')).nth(-1) Out[23]: n date val 3 20 2013-12-02 0.1555 6 23 2013-11-26 0.1648 </code></pre> If this isn't guaranteed then sort by the date column first: <code>df.sort('date')</code>.

Filter data to get only first day of the month rows

Tags:

pandas

python-2.7

I have a dataset of daily data. I need to get only the data of the first day of each month in the data set (The data is from 1972 to 2013). So for example I would need Index 20, Date 2013-12-02 value of 0.1555 to be extracted. The problem I have is that the first day for each month is different, so I cannot use a step such as relativedelta(months=1), how would I go about of extracting these values from my dataset?

Is there a similar command as I have found in another post for R?

R - XTS: Get the first dates and values for each month from a daily time series with missing rows

17 2013-12-05 0.1621
18 2013-12-04 0.1698
19 2013-12-03 0.1516
20 2013-12-02 0.1555
21 2013-11-29 0.1480
22 2013-11-27 0.1487
23 2013-11-26 0.1648

813

asked Sep 11 '14 21:09

tadalendas

1 Answers

I would groupby the month and then get the zeroth (nth) row of each group.

First set as index (I think this is necessary):

In [11]: df1 = df.set_index('date')

In [12]: df1
Out[12]:
             n     val
date
2013-12-05  17  0.1621
2013-12-04  18  0.1698
2013-12-03  19  0.1516
2013-12-02  20  0.1555
2013-11-29  21  0.1480
2013-11-27  22  0.1487
2013-11-26  23  0.1648

Next sort, so that the first element is the first date of that month (Note: this doesn't appear to be necessary for nth, but I think that's actually a bug!):

In [13]: df1.sort_index(inplace=True)

In [14]: df1.groupby(pd.TimeGrouper('M')).nth(0)
Out[14]:
             n     val
date
2013-11-26  23  0.1648
2013-12-02  20  0.1555

another option is to resample and take the first entry:

In [15]: df1.resample('M', 'first')
Out[15]:
             n     val
date
2013-11-30  23  0.1648
2013-12-31  20  0.1555

Thinking about this, you can do this much simpler by extracting the month and then grouping by that:

In [21]: pd.DatetimeIndex(df.date).to_period('M')
Out[21]:
<class 'pandas.tseries.period.PeriodIndex'>
[2013-12, ..., 2013-11]
Length: 7, Freq: M

In [22]: df.groupby(pd.DatetimeIndex(df.date).to_period('M')).nth(0)
Out[22]:
    n       date     val
0  17 2013-12-05  0.1621
4  21 2013-11-29  0.1480

This time the sortedness of df.date is (correctly) relevant, if you know it's in descending date order you can use nth(-1):

In [23]: df.groupby(pd.DatetimeIndex(df.date).to_period('M')).nth(-1)
Out[23]:
    n       date     val
3  20 2013-12-02  0.1555
6  23 2013-11-26  0.1648

If this isn't guaranteed then sort by the date column first: df.sort('date').

165

answered Oct 31 '22 00:10

Andy Hayden

Related questions
                            
                                Python: sqlite no matching distribution found for sqlite
                            
                                python: argparse throwing value error when combining positional and optional argument
                            
                                python cv2.Videocapture() does not work, cap.isOpened() returns false
                            
                                How to change jupyter kernel from Python 2 to python 3?
                            
                                User-defined exception: <unprintable ... object>
                            
                                Uninstall python.org version of python2.7 in favor of default OS X python2.7
                            
                                command for inverse ERF function in python [closed]
                            
                                Difference between BaseSpider and CrawlSpider
                            
                                Selenium leaves behind running processes?
                            
                                ValueError: This solver needs samples of at least 2 classes in the data, but the data contains only one class: 0.0
                            
                                Python multiprocessing Process crashes silently
                            
                                Python regex matching all but last occurrence
                            
                                Google Endpoints API + Chrome Extension returns None for endpoints.get_current_user().user_id()
                            
                                Difference between Numpy and Numpy-MKL?
                            
                                AttributeError: 'tuple' object has no attribute 'startswith'
                            
                                Python class method chaining
                            
                                pow or ** for very large number in Python
                            
                                Unittest - Assert a set of items of a list are (or not) contained in another list
                            
                                PyCharm include and modify External library in project
                            
                                error: [Errno 32] Broken pipe

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With