I've got a dataframe with the following information: <pre class="prettyprint"><code> filename val1 val2 t 1 file1.csv 5 10 2 file1.csv NaN NaN 3 file1.csv 15 20 6 file2.csv NaN NaN 7 file2.csv 10 20 8 file2.csv 12 15 </code></pre> I would like to interpolate the values in the dataframe based on the indices, but only within each file group. To interpolate, I would normally do <pre class="prettyprint"><code>df = df.interpolate(method="index") </code></pre> And to group, I do <pre class="prettyprint"><code>grouped = df.groupby("filename") </code></pre> I would like the interpolated dataframe to look like this: <pre class="prettyprint"><code> filename val1 val2 t 1 file1.csv 5 10 2 file1.csv 10 15 3 file1.csv 15 20 6 file2.csv NaN NaN 7 file2.csv 10 20 8 file2.csv 12 15 </code></pre> Where the NaN's are still present at t = 6 since they are the first items in the file2 group. I suspect I need to use "apply", but haven't been able to figure out exactly how... <pre class="prettyprint"><code>grouped.apply(interp1d) ... TypeError: __init__() takes at least 3 arguments (2 given) </code></pre> Any help would be appreciated.

I ran into this as well. Instead of using <code>apply</code>, you can use <code>transform</code>, which will reduce your run time by more than 25% if you have on the order of 1000 groups: <pre class="prettyprint"><code>import numpy as np import pandas as pd np.random.seed(500) test_df = pd.DataFrame({ 'a': np.random.randint(low=0, high=1000, size=10000), 'b': np.random.choice([1, 2, 4, 7, np.nan], size=10000, p=([0.2475]*4 + [0.01])) }) </code></pre> Tests: <pre class="prettyprint"><code>%timeit test_df.groupby('a').transform(pd.DataFrame.interpolate) </code></pre> Output: <code>566 ms ± 27.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)</code> <pre class="prettyprint"><code>%timeit test_df.groupby('a').apply(pd.DataFrame.interpolate) </code></pre> Output: <code>788 ms ± 10.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)</code> <pre class="prettyprint"><code>%timeit test_df.groupby('a').apply(lambda group: group.interpolate()) </code></pre> Output: <code>787 ms ± 17.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)</code> <pre class="prettyprint"><code>%timeit test_df.interpolate() </code></pre> Output: <code>918 µs ± 16.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)</code> You will still see a significant increase in run-time compared to a fully vectorized call to <code>interpolate</code> on the full DataFrame, but I don't think you can do much better in pandas.

Pandas interpolate within a groupby

Tags:

I've got a dataframe with the following information:

    filename    val1    val2 t                    1   file1.csv   5       10 2   file1.csv   NaN     NaN 3   file1.csv   15      20 6   file2.csv   NaN     NaN 7   file2.csv   10      20 8   file2.csv   12      15

I would like to interpolate the values in the dataframe based on the indices, but only within each file group.

To interpolate, I would normally do

df = df.interpolate(method="index")

And to group, I do

grouped = df.groupby("filename")

I would like the interpolated dataframe to look like this:

    filename    val1    val2 t                    1   file1.csv   5       10 2   file1.csv   10      15 3   file1.csv   15      20 6   file2.csv   NaN     NaN 7   file2.csv   10      20 8   file2.csv   12      15

Where the NaN's are still present at t = 6 since they are the first items in the file2 group.

I suspect I need to use "apply", but haven't been able to figure out exactly how...

grouped.apply(interp1d) ... TypeError: __init__() takes at least 3 arguments (2 given)

Any help would be appreciated.

967

asked May 05 '16 17:05

R. W.

2 Answers

>>> df.groupby('filename').apply(lambda group: group.interpolate(method='index'))     filename  val1  val2 t                        1  file1.csv     5    10 2  file1.csv    10    15 3  file1.csv    15    20 6  file2.csv   NaN   NaN 7  file2.csv    10    20 8  file2.csv    12    15

185

answered Oct 09 '22 22:10

Alexander

I ran into this as well. Instead of using apply, you can use transform, which will reduce your run time by more than 25% if you have on the order of 1000 groups:

import numpy as np import pandas as pd  np.random.seed(500) test_df = pd.DataFrame({     'a': np.random.randint(low=0, high=1000, size=10000),     'b': np.random.choice([1, 2, 4, 7, np.nan], size=10000, p=([0.2475]*4 + [0.01])) })

Tests:

%timeit test_df.groupby('a').transform(pd.DataFrame.interpolate)

Output: 566 ms ± 27.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit test_df.groupby('a').apply(pd.DataFrame.interpolate)

Output: 788 ms ± 10.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit test_df.groupby('a').apply(lambda group: group.interpolate())

Output: 787 ms ± 17.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit test_df.interpolate()

Output: 918 µs ± 16.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

You will still see a significant increase in run-time compared to a fully vectorized call to interpolate on the full DataFrame, but I don't think you can do much better in pandas.

answered Oct 09 '22 21:10

PMende

Related questions
                            
                                What is the point of the done() callback?
                            
                                Get most similar words, given the vector of the word (not the word itself)
                            
                                pip uninstall: "No files were found to uninstall."
                            
                                The advantage cast pointer to void* when use new
                            
                                what are the meaning of values at proc/[pid]/stat?
                            
                                Routing between modules in Angular 2
                            
                                What do the three dots before a function argument represent?
                            
                                npm WARN [email protected] requires a peer of babel-core@^6.0.0 but none was installed
                            
                                How do I create a reminder using Google Calendar API?
                            
                                Scanning classpath/modulepath in runtime in Java 9
                            
                                Create files / folders on docker-compose build or docker-compose up
                            
                                Wasm access DOM

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With