How to fillna in pandas dataframe based on pattern like in excel dragging?

Tags:

I have dataframe which should be filled by understanding rows understanding like we do in excel. If its continious integer it fill by next number itself.

Is there any function in python like this?

import pandas as pd
d = { 'year': [2019,2020,2019,2020,np.nan,np.nan], 'cat1': [1,2,3,4,np.nan,np.nan], 'cat2': ['c1','c1','c1','c2',np.nan,np.nan]}
df = pd.DataFrame(data=d)
df
    year    cat1    cat2
0   2019.0  1.0     c1
1   2020.0  2.0     c1
2   2019.0  3.0     c1
3   2020.0  4.0     c2
4   NaN     NaN     NaN
5   NaN     NaN     NaN

output required:

    year    cat1    cat2
0   2019.0  1.0     c1
1   2020.0  2.0     c1
2   2019.0  3.0     c1
3   2020.0  4.0     c2
4   2019.0  5.0     c2 #here can be ignored if it can't understand the earlier pattern
5   2020.0  6.0     c2 #here can be ignored if it can't understand the earlier pattern

I tried df.interpolate(method='krogh') #it fill 1,2,3,4,5,6 but incorrect others.

418

asked Oct 08 '21 11:10

vishvas chauhan

Video Answer

4 Answers

Here is my solution for the specific use case you mention -

The code for these helper functions for categorical_repeat, continous_interpolate and other is provided below in EXPLANATION > Approach section.

config = {'year':categorical_repeat,    #shortest repeating sequence
          'cat1':continous_interpolate, #curve fitting (linear)
          'cat2':other}                 #forward fill

print(df.agg(config))

     year  cat1 cat2
0  2019.0     1   c1
1  2020.0     2   c1
2  2019.0     3   c1
3  2020.0     4   c2
4  2019.0     5   c2
5  2020.0     6   c2

EXPLANATION:

As I understand, there is no direct way of handling all types of patterns in pandas as excel does. Excel involves linear interpolation for continuous sequences, but it involves other methods for other column patterns.

Continous integer array -> linear interpolation
Repeated cycles -> Smallest repeating sequence
Alphabet (and similar) -> Tiling fixed sequence until the length of df
Unrecognizable pattern -> Forward fill

Here is the dummy dataset that I attempt my approach on -

data = {'A': [2019, 2020, 2019, 2020, 2019, 2020],
        'B': [1, 2, 3, 4, 5, 6],
        'C': [6, 5, 4, 3, 2, 1],
        'D': ['C', 'D', 'E', 'F', 'G', 'H'],
        'E': ['A', 'B', 'C', 'A', 'B', 'C'],
        'F': [1,2,3,3,4,2]
       }

df = pd.DataFrame(data)
empty = pd.DataFrame(columns=df.columns, index=df.index)[:4]
df_new = df.append(empty).reset_index(drop=True)
print(df_new)

      A    B    C    D    E    F
0  2019    1    6    C    A    1
1  2020    2    5    D    B    2
2  2019    3    4    E    C    3
3  2020    4    3    F    A    3
4  2019    5    2    G    B    4
5  2020    6    1    H    C    2
6   NaN  NaN  NaN  NaN  NaN  NaN
7   NaN  NaN  NaN  NaN  NaN  NaN
8   NaN  NaN  NaN  NaN  NaN  NaN
9   NaN  NaN  NaN  NaN  NaN  NaN

Approach:

Let's start with some helper functions -

import numpy as np
import scipy as sp
import pandas as pd


#Curve fitting (linear)
def f(x, m, c):
    return m*x+c     #Modify to extrapolate for exponential sequences etc.

#Interpolate continous linear
def continous_interpolate(s):
    clean = s.dropna()
    popt, pcov = sp.optimize.curve_fit(f, clean.index, clean)
    output = [round(i) for i in f(s.index, *popt)]  #Remove the round() for float values
    return pd.Series(output)

#Smallest Repeating sub-sequence
def pattern(inputv):
    '''
    https://stackoverflow.com/questions/6021274/finding-shortest-repeating-cycle-in-word
    '''
    pattern_end =0
    for j in range(pattern_end+1,len(inputv)):

        pattern_dex = j%(pattern_end+1)
        if(inputv[pattern_dex] != inputv[j]):

            pattern_end = j;
            continue

        if(j == len(inputv)-1):
            return inputv[0:pattern_end+1];
    return inputv;

#Categorical repeat imputation
def categorical_repeat(s):
    clean = s.dropna()
    cycle = pattern(clean)
    
    repetitions = (len(s)//len(cycle))+1
    output = np.tile(cycle, repetitions)[:len(s)]
    return pd.Series(output)

#continous sequence of alphabets
def alphabet(s):
    alp = 'abcdefghijklmnopqrstuvwxyz'
    alp2 = alp*((len(s)//len(alp))+1)
    
    start = s[0]
    idx = alp2.find(start.lower())
    output = alp2[idx:idx+len(s)]

    if start.isupper():
        output = output.upper()
    
    return pd.Series(list(output))

#If no pattern then just ffill
def other(s):
    return s.ffill()

Next, lets create a configuration based on what we want to solve and apply the methods required -

config = {'A':categorical_repeat,
          'B':continous_interpolate, 
          'C':continous_interpolate, 
          'D':alphabet,
          'E':categorical_repeat, 
          'F':other}

output_df = df_new.agg(config)
print(output_df)

      A   B  C  D  E  F
0  2019   1  6  C  A  1
1  2020   2  5  D  B  2
2  2019   3  4  E  C  3
3  2020   4  3  F  A  3
4  2019   5  2  G  B  4
5  2020   6  1  H  C  2
6  2019   7  0  I  A  2
7  2020   8 -1  J  B  2
8  2019   9 -2  K  C  2
9  2020  10 -3  L  A  2

answered Oct 19 '22 09:10

Akshay Sehgal

I tested some stuff out and did some more research. It appears pandas does not currently offer the functionality you're looking for.

df['cat'].interpolate(method='linear') will only work if the first/last values are filled in already. You would have to manually assign df.loc[5, 'cat1'] = 6 in this example, then a linear interpolation would work.

Some Options:

If the data is small enough, you can always export to Excel and use the fill there, then bring back into pandas.
Analyze the patterns yourself and design your own fill methods. For example, to get the year, you can use df['year'] = df.index.to_series().apply(lambda x: 2019 if x % 2 == 0 else 2020).

There are other Stack Overflow questions very similar to this, and none that I saw have a generic answer.

answered Oct 19 '22 10:10

Chuck Tucker

I would do the following:

from pandas.api.types import is_numeric_dtype
from itertools import cycle

def excel_drag(column):
    
    S = column[column.bfill().dropna().index].copy()  # drop last empty values
    numeric = is_numeric_dtype(S)
    groups = S.groupby(S, sort=False).apply(lambda df: len(df.index))
    
    if (len(groups) == len(S)):
        if numeric:
            # Extrapolate
            return column.interpolate(method='krogh')
        else:
            # ffill
            return column.ffill()
        
    elif (groups == groups.iloc[0]).all():  # All equal
        # Repeat sequence
        seq_len = len(groups)
        seq = cycle(S.iloc[:seq_len].values)
        filling = column[column.bfill().isna()].apply(lambda x: next(seq))
        return column.fillna(filling)
    
    else:
        # ffill
        return column.ffill()

With that function, df.apply(excel_drag, axis=0) results in:

     year  cat1 cat2
0  2019.0   1.0   c1
1  2020.0   2.0   c1
2  2019.0   3.0   c1
3  2020.0   4.0   c2
4  2019.0   5.0   c2
5  2020.0   6.0   c2

answered Oct 19 '22 09:10

jabellcu

Below is my answer for the year. I understand that the cat1 is handled and cat2 can be ignored. One assumption I've made base on looking at the question is that the repeat pattern is consistent. If not, the factorize may not work.

The idea is to use factorise to extract the repeat pattern. Then form a list of the repeat pattern. The excel drag function is a cycle of the repeat pattern. So it's natural to use itertools cycle. (PS: this is first done in @jabellcu answer, so I don't want to take credit for it. If you think factorize + cycle is good, please check his answer.)

An advantage is that the codes is generic. You don't have to hardcode the values. You can turn it into a function and call it for whatever values there are in the dataframe.

import pandas as pd
d = { 'year': [2019,2020,2019,2020,np.nan,np.nan], 'cat1': [1,2,3,4,np.nan,np.nan], 'cat2': ['c1','c1','c1','c2',np.nan,np.nan]}
df = pd.DataFrame(data=d)
df

Enhanced Answer:

l = df['year'].factorize()[1].to_list()
c = cycle(l)
df['year'] = [next(c) for i in range(len(df))]

df['cat1'] = df['cat1'].interpolate(method='krogh') 
df['cat2'] = df['cat2'].fillna(method='ffill')
df

PS: I've left a question on post regarding how to handle cat2. Currently, I just assume it's ffill for the time being.

enter image description here

From the reading of the question, I assume that you don't need detection logic. So I won't provide. I just provide the conversion logic.

answered Oct 19 '22 08:10

EBDS

Related questions
                            
                                Number of instances per class in pytorch dataset
                            
                                What does next() and iter() do in PyTorch's DataLoader()
                            
                                Is AWS boto (python) supporting SES signature version 4?
                            
                                Create sub cell in Spyder
                            
                                Pandas Dataframe replace part of string with value from another column
                            
                                X axis in Matplotlib print random numbers instead of the years
                            
                                Best way to specify nested dict with pydantic?
                            
                                Finding the width of the emoji using python3
                            
                                How do add an assembled field to a Pydantic model
                            
                                What is the safest way to queue multiple threads originating in a loop?
                            
                                removing loops with numpy.einsum
                            
                                Pygame Tic Tak Toe Logic? How Would I Do It
                            
                                Plotly: Create a Scatter with categorical x-axis jitter and multi level axis
                            
                                Regex for extracting names starting with Mr.|Mrs|The|DR after honorable
                            
                                Google Chrome cannot read and write to its data directory : selenium
                            
                                Unable to start Redis Queue (RQ) worker in Python
                            
                                Why is Python's requests 10x faster than C's libcurl?
                            
                                How to fix function/symbol 'pango_context_set_round_glyph_positions' error
                            
                                s3fs suddenly stopped working in Google Colab with error "AttributeError: module 'aiobotocore' has no attribute 'AioSession'" [closed]
                            
                                walrus operator in dict comprehension

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to fillna in pandas dataframe based on pattern like in excel dragging?

Tags:

python

pandas

dataframe

numpy