Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Apply a function to each cell of a pandas dataframe using information from a particular column

I want to expand the list entries of a dataframe using the information in column i:

i   s_1         s_1        s_3
2   [1, 2, 3]   [3, 4, 5]  NaN
1   NaN         [0, 0, 0]  [2]

The i value just indicates how often the last value of each list should be copied:

i   s_1               s_1              s_3
2   [1, 2, 3, 3, 3]   [3, 4, 5, 5, 5]  NaN
1   NaN               [0, 0, 0, 0]     [2, 2]

I am currently using a nested apply loop:

test.apply(lambda x: x.apply(
     lambda y: np.pad(y, (0, x.i), 'constant', constant_values=y[-1]) if type(y)==list else 0), axis=1)

However, this is very slow and if i have a lot of rows (>10.000) the code breaks. This solution seems a bit messy and i'm wondering what the best approach would be to do something like that?

like image 954
J-H Avatar asked Apr 23 '21 08:04

J-H


People also ask

How to apply a function to each row in pandas Dataframe?

11 Ways to Apply a Function to Each Row in Pandas DataFrame Method 1. Loop Over All Rows of a DataFrame. The simplest method to process each row in the good old Python loop. This... Method 2. Iterate over rows with iterrows Function. Instead of processing each row in a Python loop, let’s try ...

What is the difference between apply() and transform() methods in pandas?

The apply () method applies the function along a specified axis. It passes the columns as a dataframe to the custom function, whereas a transform () method passes individual columns as pandas Series to the custom function.

How do I transform a column in a Dataframe in pandas?

Use transform () to Apply a Function to Pandas DataFrame Column In Pandas, columns and dataframes can be transformed and manipulated using methods such as apply () and transform (). The desired transformations are passed in as arguments to the methods as functions. Each method has its subtle differences and utility.

How to use lambda function in pandas Dataframe?

You can apply the lambda function for a single column in the DataFrame. The following example subtracts every cell value by 2 for column A – df ["A"]=df ["A"].apply (lambda x:x-2). Yields below output. Similarly, you can also apply the Lambda function to all & multiple columns in pandas, I will leave this to you to explore. 8.


3 Answers

You can try to extend the lists inplace:

for col in df.loc[:, "s_1":]:
    m = df[col].notna()

    for i, v in zip(df.loc[m, "i"], df.loc[m, col]):
        v.extend([v[-1]] * i)

    df.loc[~m, col] = 0

Benchmark:

from timeit import timeit
from ast import literal_eval


def get_df():
    dfs = []

    # create some big dataframe
    for i in range(5000):
        txt = """
        i   s_1         s_1        s_3
        2   [1, 2, 3]   [3, 4, 5]  NaN
        1   NaN         [0, 0, 0]  [2]  """

        df = pd.read_csv(StringIO(txt), sep=r"\s{2,}", engine="python")

        df.loc[:, "s_1":] = df.loc[:, "s_1":].apply(
            lambda x: [v if pd.isna(v) else literal_eval(v) for v in x]
        )
        dfs.append(df)

    return pd.concat(dfs)


def f1(df):
    for col in df.loc[:, "s_1":]:
        m = df[col].notna()

        for i, v in zip(df.loc[m, "i"], df.loc[m, col]):
            v.extend([v[-1]] * i)

        df.loc[~m, col] = 0
    return df


def f2(df):
    df = df.apply(
        lambda x: x.apply(
            lambda y: np.pad(y, (0, x.i), "constant", constant_values=y[-1])
            if type(y) == list
            else 0
        ),
        axis=1,
    )
    return df


df1 = get_df()
df2 = get_df()

t1 = timeit(lambda: f1(df1), number=1)
t2 = timeit(lambda: f2(df2), number=1)

print(t1)
print(t2)

Prints:

0.01171580795198679
2.3192087680799887

So improvement ~200x

like image 117
Andrej Kesely Avatar answered Oct 25 '22 20:10

Andrej Kesely


We can stack the datafarme and use list comprehension to pad the values, then unstack back to reshape

s = df.set_index('i', append=True).stack()
s[:] = [v + v[-1:] * r for r, v in zip(s.index.get_level_values(1), s)]
out = s.unstack()

   i              s_1            s_1.1     s_3
0  2  [1, 2, 3, 3, 3]  [3, 4, 5, 5, 5]     NaN
1  1              NaN     [0, 0, 0, 0]  [2, 2]

Performance checks

# Prepare a sample dataframe with 10,000 rows
df = pd.concat([df] * 5000, ignore_index=True)


# Solution with stack and unstack
%%timeit
s = df.set_index('i', append=True).stack()
s[:] = [v + v[-1:] * r for r, v in zip(s.index.get_level_values(1), s)]
_ = s.unstack()
# 38.9 ms ± 700 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)


# OP's solution with apply and np.pad
%%timeit 
df.apply(lambda x: x.apply(
    lambda y: np.pad(y, (0, x.i), 'constant', constant_values=y[-1]) if type(y) == list else 0), axis=1)
# 7.92 s ± 102 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
like image 21
Shubham Sharma Avatar answered Oct 25 '22 22:10

Shubham Sharma


To avoid itering over rows you might try the following

df = pd.DataFrame( data=[[2, [1, 2, 3],[3, 4,5],None],[1,None,[0, 0, 0] ,[2]]],columns = ['i','s_1','s_2','s_3'])

for col in ['s_1','s_2','s_3']:
    df[col] = df[col] + df['i']*df[col].apply(lambda x : [x[-1]] if type(x)==list else x)

Output

i s_1 s_2 s_3
2 [1, 2, 3, 3, 3] [3, 4, 5, 5, 5] nan
1 nan [0, 0, 0, 0] [2, 2]
like image 40
Sebastien D Avatar answered Oct 25 '22 21:10

Sebastien D