I've got a dataframe with the following information:
filename val1 val2 t 1 file1.csv 5 10 2 file1.csv NaN NaN 3 file1.csv 15 20 6 file2.csv NaN NaN 7 file2.csv 10 20 8 file2.csv 12 15
I would like to interpolate the values in the dataframe based on the indices, but only within each file group.
To interpolate, I would normally do
df = df.interpolate(method="index")
And to group, I do
grouped = df.groupby("filename")
I would like the interpolated dataframe to look like this:
filename val1 val2 t 1 file1.csv 5 10 2 file1.csv 10 15 3 file1.csv 15 20 6 file2.csv NaN NaN 7 file2.csv 10 20 8 file2.csv 12 15
Where the NaN's are still present at t = 6 since they are the first items in the file2 group.
I suspect I need to use "apply", but haven't been able to figure out exactly how...
grouped.apply(interp1d) ... TypeError: __init__() takes at least 3 arguments (2 given)
Any help would be appreciated.
You can interpolate missing values ( NaN ) in pandas. DataFrame and Series with interpolate() . This article describes the following contents. Use dropna() and fillna() to remove missing values NaN or to fill them with a specific value.
Pandas DataFrame interpolate() Method The interpolate() method replaces the NULL values based on a specified method.
Pandas Groupby – Sort within groups Last Updated : 29 Aug, 2020 Pandas Groupby is used in situations where we want to split data and set into groups so that we can do various operations on those groups like – Aggregation of data, Transformation through some group computations or Filtration according to specific conditions applied on the groups.
So, let’s see different ways to do this task. First, Let’s create a dataframe: Method 1: Using Dataframe.groupby (). This function is used to split the data into groups based on some criteria. Example: we’ll simply iterate over all the groups created. In above example, we have grouped on the basis of column “X”.
Pandas Interpolate | How Interpolate Function works in Pandas? Pandas interpolate work is essentially used to fill NA esteems in the dataframe or arrangement. Yet, this is an amazing capacity to fill the missing qualities. It utilizes different interjection procedure to fill the missing qualities instead of hard-coding the worth.
First, Let’s create a dataframe: Method 1: Using Dataframe.groupby (). This function is used to split the data into groups based on some criteria. Example: we’ll simply iterate over all the groups created.
>>> df.groupby('filename').apply(lambda group: group.interpolate(method='index')) filename val1 val2 t 1 file1.csv 5 10 2 file1.csv 10 15 3 file1.csv 15 20 6 file2.csv NaN NaN 7 file2.csv 10 20 8 file2.csv 12 15
I ran into this as well. Instead of using apply
, you can use transform
, which will reduce your run time by more than 25% if you have on the order of 1000 groups:
import numpy as np import pandas as pd np.random.seed(500) test_df = pd.DataFrame({ 'a': np.random.randint(low=0, high=1000, size=10000), 'b': np.random.choice([1, 2, 4, 7, np.nan], size=10000, p=([0.2475]*4 + [0.01])) })
Tests:
%timeit test_df.groupby('a').transform(pd.DataFrame.interpolate)
Output: 566 ms ± 27.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit test_df.groupby('a').apply(pd.DataFrame.interpolate)
Output: 788 ms ± 10.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit test_df.groupby('a').apply(lambda group: group.interpolate())
Output: 787 ms ± 17.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit test_df.interpolate()
Output: 918 µs ± 16.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
You will still see a significant increase in run-time compared to a fully vectorized call to interpolate
on the full DataFrame, but I don't think you can do much better in pandas.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With