Consider the following DataFrame: <pre class="prettyprint"><code> value item_uid created_at 0S0099v8iI 2015-03-25 10652.79 0F01ddgkRa 2015-03-25 1414.71 0F02BZeTr6 2015-03-20 51505.22 2015-03-23 51837.97 2015-03-24 51578.63 2015-03-25 NaN 2015-03-26 NaN 2015-03-27 50893.42 0F02BcIzNo 2015-03-17 1230.00 2015-03-23 1130.00 0F02F4gAMs 2015-03-25 1855.96 0F02Vwd6Ou 2015-03-19 5709.33 0F04OlAs0R 2015-03-18 321.44 0F05GInfPa 2015-03-16 664.68 0F05PQARFJ 2015-03-18 1074.31 2015-03-26 1098.31 0F06LFhBCK 2015-03-18 211.49 0F06ryso80 2015-03-16 13.73 2015-03-20 12.00 0F07gg7Oth 2015-03-19 2325.70 </code></pre> I need to sample the full dataframe between two dates <code>start_date</code> and <code>end_date</code> on every date between them, propagating the last seen value. The sampling should be done within each <code>item_uid</code> independently/separately. For example, if we were to sample between <code>2015-03-20</code> and <code>2015-03-29</code> for <code>0F02BZeTr6</code>, we should get: <pre class="prettyprint"><code>0F02BZeTr6 2015-03-20 51505.22 2015-03-21 51505.22 2015-03-22 51505.22 2015-03-23 51837.97 2015-03-24 51578.63 2015-03-25 51578.63 2015-03-26 51578.63 2015-03-27 50893.42 2015-03-28 50893.42 2015-03-29 50893.42 </code></pre> Note that I am forward filling both <code>NaN</code> and missing entries in the dataframe. This other question addresses a similar problem, but only with one group (i.e. one level). This question instead asks how to do the same but within each group (<code>item_uid</code>) separately. While I could split the input dataframe and iterate through each of the groups (each of the <code>item_uid</code>), and then stitch together the result, I am wondering if there is anything more efficient. When I do the following (see this PR): <pre class="prettyprint"><code>dates = pd.date_range(start=start_date, end=end_date) df.groupby(level='itemuid').apply(lambda x: x.reindex(dates, method='ffill')) </code></pre> I get: <pre class="prettyprint"><code>TypeError: Fill method not supported if level passed </code></pre>

You have a couple of options, the easiest IMO is to simply unstack the first level and then ffill. I think this make it much clearer about what's going on than a groupby/resample solution (I suspect it will also be faster, depending on the data): <pre class="prettyprint"><code>In [11]: df1['value'].unstack(0) Out[11]: item_uid 0F01ddgkRa 0F02BZeTr6 0F02BcIzNo 0F02F4gAMs 0F02Vwd6Ou 0F04OlAs0R 0F05GInfPa 0F05PQARFJ 0F06LFhBCK 0F06ryso80 0F07gg7Oth 0S0099v8iI created_at 2015-03-16 NaN NaN NaN NaN NaN NaN 664.68 NaN NaN 13.73 NaN NaN 2015-03-17 NaN NaN 1230 NaN NaN NaN NaN NaN NaN NaN NaN NaN 2015-03-18 NaN NaN NaN NaN NaN 321.44 NaN 1074.31 211.49 NaN NaN NaN 2015-03-19 NaN NaN NaN NaN 5709.33 NaN NaN NaN NaN NaN 2325.7 NaN 2015-03-20 NaN 51505.22 NaN NaN NaN NaN NaN NaN NaN 12.00 NaN NaN 2015-03-23 NaN 51837.97 1130 NaN NaN NaN NaN NaN NaN NaN NaN NaN 2015-03-24 NaN 51578.63 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 2015-03-25 1414.71 NaN NaN 1855.96 NaN NaN NaN NaN NaN NaN NaN 10652.79 2015-03-26 NaN NaN NaN NaN NaN NaN NaN 1098.31 NaN NaN NaN NaN 2015-03-27 NaN 50893.42 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN </code></pre> If you're missing some dates you have to reindex (assuming the start and end are present, otherwise you can do this manually e.g. with <code>pd.date_range</code>): <pre class="prettyprint"><code>In [12]: df1['value'].unstack(0).asfreq('D') Out[12]: item_uid 0F01ddgkRa 0F02BZeTr6 0F02BcIzNo 0F02F4gAMs 0F02Vwd6Ou 0F04OlAs0R 0F05GInfPa 0F05PQARFJ 0F06LFhBCK 0F06ryso80 0F07gg7Oth 0S0099v8iI 2015-03-16 NaN NaN NaN NaN NaN NaN 664.68 NaN NaN 13.73 NaN NaN 2015-03-17 NaN NaN 1230 NaN NaN NaN NaN NaN NaN NaN NaN NaN 2015-03-18 NaN NaN NaN NaN NaN 321.44 NaN 1074.31 211.49 NaN NaN NaN 2015-03-19 NaN NaN NaN NaN 5709.33 NaN NaN NaN NaN NaN 2325.7 NaN 2015-03-20 NaN 51505.22 NaN NaN NaN NaN NaN NaN NaN 12.00 NaN NaN 2015-03-21 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 2015-03-22 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 2015-03-23 NaN 51837.97 1130 NaN NaN NaN NaN NaN NaN NaN NaN NaN 2015-03-24 NaN 51578.63 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 2015-03-25 1414.71 NaN NaN 1855.96 NaN NaN NaN NaN NaN NaN NaN 10652.79 2015-03-26 NaN NaN NaN NaN NaN NaN NaN 1098.31 NaN NaN NaN NaN 2015-03-27 NaN 50893.42 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN </code></pre> Note: <code>asfreq</code> drops the name of the index (which is most likely a bug!) Now you can ffill: <pre class="prettyprint"><code>In [13]: df1['value'].unstack(0).asfreq('D').ffill() Out[13]: item_uid 0F01ddgkRa 0F02BZeTr6 0F02BcIzNo 0F02F4gAMs 0F02Vwd6Ou 0F04OlAs0R 0F05GInfPa 0F05PQARFJ 0F06LFhBCK 0F06ryso80 0F07gg7Oth 0S0099v8iI 2015-03-16 NaN NaN NaN NaN NaN NaN 664.68 NaN NaN 13.73 NaN NaN 2015-03-17 NaN NaN 1230 NaN NaN NaN 664.68 NaN NaN 13.73 NaN NaN 2015-03-18 NaN NaN 1230 NaN NaN 321.44 664.68 1074.31 211.49 13.73 NaN NaN 2015-03-19 NaN NaN 1230 NaN 5709.33 321.44 664.68 1074.31 211.49 13.73 2325.7 NaN 2015-03-20 NaN 51505.22 1230 NaN 5709.33 321.44 664.68 1074.31 211.49 12.00 2325.7 NaN 2015-03-21 NaN 51505.22 1230 NaN 5709.33 321.44 664.68 1074.31 211.49 12.00 2325.7 NaN 2015-03-22 NaN 51505.22 1230 NaN 5709.33 321.44 664.68 1074.31 211.49 12.00 2325.7 NaN 2015-03-23 NaN 51837.97 1130 NaN 5709.33 321.44 664.68 1074.31 211.49 12.00 2325.7 NaN 2015-03-24 NaN 51578.63 1130 NaN 5709.33 321.44 664.68 1074.31 211.49 12.00 2325.7 NaN 2015-03-25 1414.71 51578.63 1130 1855.96 5709.33 321.44 664.68 1074.31 211.49 12.00 2325.7 10652.79 2015-03-26 1414.71 51578.63 1130 1855.96 5709.33 321.44 664.68 1098.31 211.49 12.00 2325.7 10652.79 2015-03-27 1414.71 50893.42 1130 1855.96 5709.33 321.44 664.68 1098.31 211.49 12.00 2325.7 10652.79 </code></pre> and stack it back (Note: you can dropna=False if you want to include the starting NaN): <pre class="prettyprint"><code>In [14]: s = df1['value'].unstack(0).asfreq('D').ffill().stack() </code></pre> Note: If you the ordering of the index is important you can switch/sort it: <pre class="prettyprint"><code>In [15]: s.index = s.index.swaplevel(0, 1) In [16]: s = s.sort_index() In [17]: s.index.names = ['item_uid', 'created_at'] # as this is lost earlier In [18]: s Out[18]: item_uid 0F01ddgkRa 2015-03-25 1414.71 2015-03-26 1414.71 2015-03-27 1414.71 0F02BZeTr6 2015-03-20 51505.22 2015-03-21 51505.22 2015-03-22 51505.22 2015-03-23 51837.97 2015-03-24 51578.63 2015-03-25 51578.63 2015-03-26 51578.63 2015-03-27 50893.42 ... 0S0099v8iI 2015-03-25 10652.79 2015-03-26 10652.79 2015-03-27 10652.79 Length: 100, dtype: float64 </code></pre> Whether this is more efficient than a groupby/resample apply solution will depend on the data. For very sparse data (with lots of starting up NaN, assuming you want to drop these) I suspect it won't be as fast. If the data is dense (or you want to keep the initial NaN) I suspect this solution should be faster.

Efficiently re-indexing one level with "forward-fill" in a multi-index dataframe

Tags:

python

pandas

Consider the following DataFrame:

                          value
item_uid   created_at          

0S0099v8iI 2015-03-25  10652.79
0F01ddgkRa 2015-03-25   1414.71
0F02BZeTr6 2015-03-20  51505.22
           2015-03-23  51837.97
           2015-03-24  51578.63
           2015-03-25       NaN
           2015-03-26       NaN
           2015-03-27  50893.42
0F02BcIzNo 2015-03-17   1230.00
           2015-03-23   1130.00
0F02F4gAMs 2015-03-25   1855.96
0F02Vwd6Ou 2015-03-19   5709.33
0F04OlAs0R 2015-03-18    321.44
0F05GInfPa 2015-03-16    664.68
0F05PQARFJ 2015-03-18   1074.31
           2015-03-26   1098.31
0F06LFhBCK 2015-03-18    211.49
0F06ryso80 2015-03-16     13.73
           2015-03-20     12.00
0F07gg7Oth 2015-03-19   2325.70

I need to sample the full dataframe between two dates start_date and end_date on every date between them, propagating the last seen value. The sampling should be done within each item_uid independently/separately.

For example, if we were to sample between 2015-03-20 and 2015-03-29 for 0F02BZeTr6, we should get:

0F02BZeTr6 2015-03-20  51505.22
           2015-03-21  51505.22
           2015-03-22  51505.22
           2015-03-23  51837.97
           2015-03-24  51578.63
           2015-03-25  51578.63
           2015-03-26  51578.63
           2015-03-27  50893.42
           2015-03-28  50893.42
           2015-03-29  50893.42

Note that I am forward filling both NaN and missing entries in the dataframe.

This other question addresses a similar problem, but only with one group (i.e. one level). This question instead asks how to do the same but within each group (item_uid) separately. While I could split the input dataframe and iterate through each of the groups (each of the item_uid), and then stitch together the result, I am wondering if there is anything more efficient.

When I do the following (see this PR):

dates         = pd.date_range(start=start_date, end=end_date)    
df.groupby(level='itemuid').apply(lambda x: x.reindex(dates, method='ffill'))

I get:

TypeError: Fill method not supported if level passed

798

asked Mar 30 '15 21:03

Amelio Vazquez-Reina

1 Answers

You have a couple of options, the easiest IMO is to simply unstack the first level and then ffill. I think this make it much clearer about what's going on than a groupby/resample solution (I suspect it will also be faster, depending on the data):

In [11]: df1['value'].unstack(0)
Out[11]:
item_uid    0F01ddgkRa  0F02BZeTr6  0F02BcIzNo  0F02F4gAMs  0F02Vwd6Ou  0F04OlAs0R  0F05GInfPa  0F05PQARFJ  0F06LFhBCK  0F06ryso80  0F07gg7Oth  0S0099v8iI
created_at
2015-03-16         NaN         NaN         NaN         NaN         NaN         NaN      664.68         NaN         NaN       13.73         NaN         NaN
2015-03-17         NaN         NaN        1230         NaN         NaN         NaN         NaN         NaN         NaN         NaN         NaN         NaN
2015-03-18         NaN         NaN         NaN         NaN         NaN      321.44         NaN     1074.31      211.49         NaN         NaN         NaN
2015-03-19         NaN         NaN         NaN         NaN     5709.33         NaN         NaN         NaN         NaN         NaN      2325.7         NaN
2015-03-20         NaN    51505.22         NaN         NaN         NaN         NaN         NaN         NaN         NaN       12.00         NaN         NaN
2015-03-23         NaN    51837.97        1130         NaN         NaN         NaN         NaN         NaN         NaN         NaN         NaN         NaN
2015-03-24         NaN    51578.63         NaN         NaN         NaN         NaN         NaN         NaN         NaN         NaN         NaN         NaN
2015-03-25     1414.71         NaN         NaN     1855.96         NaN         NaN         NaN         NaN         NaN         NaN         NaN    10652.79
2015-03-26         NaN         NaN         NaN         NaN         NaN         NaN         NaN     1098.31         NaN         NaN         NaN         NaN
2015-03-27         NaN    50893.42         NaN         NaN         NaN         NaN         NaN         NaN         NaN         NaN         NaN         NaN

If you're missing some dates you have to reindex (assuming the start and end are present, otherwise you can do this manually e.g. with pd.date_range):

In [12]: df1['value'].unstack(0).asfreq('D')
Out[12]:
item_uid    0F01ddgkRa  0F02BZeTr6  0F02BcIzNo  0F02F4gAMs  0F02Vwd6Ou  0F04OlAs0R  0F05GInfPa  0F05PQARFJ  0F06LFhBCK  0F06ryso80  0F07gg7Oth  0S0099v8iI
2015-03-16         NaN         NaN         NaN         NaN         NaN         NaN      664.68         NaN         NaN       13.73         NaN         NaN
2015-03-17         NaN         NaN        1230         NaN         NaN         NaN         NaN         NaN         NaN         NaN         NaN         NaN
2015-03-18         NaN         NaN         NaN         NaN         NaN      321.44         NaN     1074.31      211.49         NaN         NaN         NaN
2015-03-19         NaN         NaN         NaN         NaN     5709.33         NaN         NaN         NaN         NaN         NaN      2325.7         NaN
2015-03-20         NaN    51505.22         NaN         NaN         NaN         NaN         NaN         NaN         NaN       12.00         NaN         NaN
2015-03-21         NaN         NaN         NaN         NaN         NaN         NaN         NaN         NaN         NaN         NaN         NaN         NaN
2015-03-22         NaN         NaN         NaN         NaN         NaN         NaN         NaN         NaN         NaN         NaN         NaN         NaN
2015-03-23         NaN    51837.97        1130         NaN         NaN         NaN         NaN         NaN         NaN         NaN         NaN         NaN
2015-03-24         NaN    51578.63         NaN         NaN         NaN         NaN         NaN         NaN         NaN         NaN         NaN         NaN
2015-03-25     1414.71         NaN         NaN     1855.96         NaN         NaN         NaN         NaN         NaN         NaN         NaN    10652.79
2015-03-26         NaN         NaN         NaN         NaN         NaN         NaN         NaN     1098.31         NaN         NaN         NaN         NaN
2015-03-27         NaN    50893.42         NaN         NaN         NaN         NaN         NaN         NaN         NaN         NaN         NaN         NaN

Note: asfreq drops the name of the index (which is most likely a bug!)

Now you can ffill:

In [13]: df1['value'].unstack(0).asfreq('D').ffill()
Out[13]:
item_uid    0F01ddgkRa  0F02BZeTr6  0F02BcIzNo  0F02F4gAMs  0F02Vwd6Ou  0F04OlAs0R  0F05GInfPa  0F05PQARFJ  0F06LFhBCK  0F06ryso80  0F07gg7Oth  0S0099v8iI
2015-03-16         NaN         NaN         NaN         NaN         NaN         NaN      664.68         NaN         NaN       13.73         NaN         NaN
2015-03-17         NaN         NaN        1230         NaN         NaN         NaN      664.68         NaN         NaN       13.73         NaN         NaN
2015-03-18         NaN         NaN        1230         NaN         NaN      321.44      664.68     1074.31      211.49       13.73         NaN         NaN
2015-03-19         NaN         NaN        1230         NaN     5709.33      321.44      664.68     1074.31      211.49       13.73      2325.7         NaN
2015-03-20         NaN    51505.22        1230         NaN     5709.33      321.44      664.68     1074.31      211.49       12.00      2325.7         NaN
2015-03-21         NaN    51505.22        1230         NaN     5709.33      321.44      664.68     1074.31      211.49       12.00      2325.7         NaN
2015-03-22         NaN    51505.22        1230         NaN     5709.33      321.44      664.68     1074.31      211.49       12.00      2325.7         NaN
2015-03-23         NaN    51837.97        1130         NaN     5709.33      321.44      664.68     1074.31      211.49       12.00      2325.7         NaN
2015-03-24         NaN    51578.63        1130         NaN     5709.33      321.44      664.68     1074.31      211.49       12.00      2325.7         NaN
2015-03-25     1414.71    51578.63        1130     1855.96     5709.33      321.44      664.68     1074.31      211.49       12.00      2325.7    10652.79
2015-03-26     1414.71    51578.63        1130     1855.96     5709.33      321.44      664.68     1098.31      211.49       12.00      2325.7    10652.79
2015-03-27     1414.71    50893.42        1130     1855.96     5709.33      321.44      664.68     1098.31      211.49       12.00      2325.7    10652.79

and stack it back (Note: you can dropna=False if you want to include the starting NaN):

In [14]: s = df1['value'].unstack(0).asfreq('D').ffill().stack()

Note: If you the ordering of the index is important you can switch/sort it:

In [15]: s.index = s.index.swaplevel(0, 1)

In [16]: s = s.sort_index()

In [17]: s.index.names = ['item_uid', 'created_at']  # as this is lost earlier

In [18]: s
Out[18]:
item_uid
0F01ddgkRa  2015-03-25     1414.71
            2015-03-26     1414.71
            2015-03-27     1414.71
0F02BZeTr6  2015-03-20    51505.22
            2015-03-21    51505.22
            2015-03-22    51505.22
            2015-03-23    51837.97
            2015-03-24    51578.63
            2015-03-25    51578.63
            2015-03-26    51578.63
            2015-03-27    50893.42
...
0S0099v8iI  2015-03-25    10652.79
            2015-03-26    10652.79
            2015-03-27    10652.79
Length: 100, dtype: float64

Whether this is more efficient than a groupby/resample apply solution will depend on the data. For very sparse data (with lots of starting up NaN, assuming you want to drop these) I suspect it won't be as fast. If the data is dense (or you want to keep the initial NaN) I suspect this solution should be faster.

113

answered Oct 20 '22 20:10

Andy Hayden

Related questions
                            
                                os.remove() in windows gives "[Error 32] being used by another process"
                            
                                Sending Keys Using Splinter
                            
                                Python save matplotlib figure with exact pixel size
                            
                                Running TextBlob in Python3
                            
                                Removal of an item from a python list, how are items compared (e.g. numpy arrays)?
                            
                                Python multiprocessing daemon vs non-daemon vs main
                            
                                'QThread: Destroyed while thread is still running' on quit
                            
                                Multiindex pandas groupby + aggregate, keep full index
                            
                                How to handle double quotes inside field values with csv module?
                            
                                How to localize Python's argparse module, without patching it?
                            
                                Can celery celerybeat use a Database Scheduler without Django?
                            
                                reading worksheet and preserving conditional formatting
                            
                                PyQt5 QTextEdit auto completion
                            
                                Flattening an array in pandas
                            
                                How can I get pycharm to NOT auto-insert a closing docstring?
                            
                                DRF TypeError 'type' object is not iterable
                            
                                Sentiment analysis of non-English texts
                            
                                How do I upgrade python 2.7.8 to 2.7.9 in Anaconda without conflicting other components in its environment?
                            
                                Python - from . import
                            
                                How to add oversampling/undersampling procedure in scikit's Pipeline?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With