Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas interpolate within a groupby

Tags:

I've got a dataframe with the following information:

    filename    val1    val2 t                    1   file1.csv   5       10 2   file1.csv   NaN     NaN 3   file1.csv   15      20 6   file2.csv   NaN     NaN 7   file2.csv   10      20 8   file2.csv   12      15 

I would like to interpolate the values in the dataframe based on the indices, but only within each file group.

To interpolate, I would normally do

df = df.interpolate(method="index") 

And to group, I do

grouped = df.groupby("filename") 

I would like the interpolated dataframe to look like this:

    filename    val1    val2 t                    1   file1.csv   5       10 2   file1.csv   10      15 3   file1.csv   15      20 6   file2.csv   NaN     NaN 7   file2.csv   10      20 8   file2.csv   12      15 

Where the NaN's are still present at t = 6 since they are the first items in the file2 group.

I suspect I need to use "apply", but haven't been able to figure out exactly how...

grouped.apply(interp1d) ... TypeError: __init__() takes at least 3 arguments (2 given) 

Any help would be appreciated.

like image 967
R. W. Avatar asked May 05 '16 17:05

R. W.


People also ask

How do pandas interpolate missing values?

You can interpolate missing values ( NaN ) in pandas. DataFrame and Series with interpolate() . This article describes the following contents. Use dropna() and fillna() to remove missing values NaN or to fill them with a specific value.

What does interpolate mean in pandas?

Pandas DataFrame interpolate() Method The interpolate() method replaces the NULL values based on a specified method.

What is groupby in pandas?

Pandas Groupby – Sort within groups Last Updated : 29 Aug, 2020 Pandas Groupby is used in situations where we want to split data and set into groups so that we can do various operations on those groups like – Aggregation of data, Transformation through some group computations or Filtration according to specific conditions applied on the groups.

How to group data into groups in Python Dataframe?

So, let’s see different ways to do this task. First, Let’s create a dataframe: Method 1: Using Dataframe.groupby (). This function is used to split the data into groups based on some criteria. Example: we’ll simply iterate over all the groups created. In above example, we have grouped on the basis of column “X”.

How pandas interpolate function works in pandas?

Pandas Interpolate | How Interpolate Function works in Pandas? Pandas interpolate work is essentially used to fill NA esteems in the dataframe or arrangement. Yet, this is an amazing capacity to fill the missing qualities. It utilizes different interjection procedure to fill the missing qualities instead of hard-coding the worth.

How to split data into groups based on criteria in Python?

First, Let’s create a dataframe: Method 1: Using Dataframe.groupby (). This function is used to split the data into groups based on some criteria. Example: we’ll simply iterate over all the groups created.


2 Answers

>>> df.groupby('filename').apply(lambda group: group.interpolate(method='index'))     filename  val1  val2 t                        1  file1.csv     5    10 2  file1.csv    10    15 3  file1.csv    15    20 6  file2.csv   NaN   NaN 7  file2.csv    10    20 8  file2.csv    12    15 
like image 185
Alexander Avatar answered Oct 09 '22 22:10

Alexander


I ran into this as well. Instead of using apply, you can use transform, which will reduce your run time by more than 25% if you have on the order of 1000 groups:

import numpy as np import pandas as pd  np.random.seed(500) test_df = pd.DataFrame({     'a': np.random.randint(low=0, high=1000, size=10000),     'b': np.random.choice([1, 2, 4, 7, np.nan], size=10000, p=([0.2475]*4 + [0.01])) }) 

Tests:

%timeit test_df.groupby('a').transform(pd.DataFrame.interpolate) 

Output: 566 ms ± 27.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit test_df.groupby('a').apply(pd.DataFrame.interpolate) 

Output: 788 ms ± 10.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit test_df.groupby('a').apply(lambda group: group.interpolate()) 

Output: 787 ms ± 17.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit test_df.interpolate() 

Output: 918 µs ± 16.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

You will still see a significant increase in run-time compared to a fully vectorized call to interpolate on the full DataFrame, but I don't think you can do much better in pandas.

like image 26
PMende Avatar answered Oct 09 '22 21:10

PMende