<p>I want to expand the list entries of a dataframe using the information in column <code>i</code>:</p> <pre class="prettyprint"><code>i s_1 s_1 s_3 2 [1, 2, 3] [3, 4, 5] NaN 1 NaN [0, 0, 0] [2] </code></pre> <p>The i value just indicates how often the last value of each list should be copied:</p> <pre class="prettyprint"><code>i s_1 s_1 s_3 2 [1, 2, 3, 3, 3] [3, 4, 5, 5, 5] NaN 1 NaN [0, 0, 0, 0] [2, 2] </code></pre> <p>I am currently using a nested apply loop:</p> <pre class="prettyprint"><code>test.apply(lambda x: x.apply( lambda y: np.pad(y, (0, x.i), 'constant', constant_values=y[-1]) if type(y)==list else 0), axis=1) </code></pre> <p>However, this is very slow and if i have a lot of rows (>10.000) the code breaks. This solution seems a bit messy and i'm wondering what the best approach would be to do something like that?</p>

<p>To avoid itering over rows you might try the following</p> <pre class="prettyprint"><code>df = pd.DataFrame( data=[[2, [1, 2, 3],[3, 4,5],None],[1,None,[0, 0, 0] ,[2]]],columns = ['i','s_1','s_2','s_3']) for col in ['s_1','s_2','s_3']: df[col] = df[col] + df['i']*df[col].apply(lambda x : [x[-1]] if type(x)==list else x) </code></pre> <p><strong>Output</strong></p> <div class="s-table-container"> <table class="s-table"> <thead><tr> <th style="text-align: right;">i</th> <th style="text-align: left;">s_1</th> <th style="text-align: left;">s_2</th> <th style="text-align: left;">s_3</th> </tr></thead> <tbody> <tr> <td style="text-align: right;">2</td> <td style="text-align: left;">[1, 2, 3, 3, 3]</td> <td style="text-align: left;">[3, 4, 5, 5, 5]</td> <td style="text-align: left;">nan</td> </tr> <tr> <td style="text-align: right;">1</td> <td style="text-align: left;">nan</td> <td style="text-align: left;">[0, 0, 0, 0]</td> <td style="text-align: left;">[2, 2]</td> </tr> </tbody> </table> </div>

Apply a function to each cell of a pandas dataframe using information from a particular column

I want to expand the list entries of a dataframe using the information in column i:

i   s_1         s_1        s_3
2   [1, 2, 3]   [3, 4, 5]  NaN
1   NaN         [0, 0, 0]  [2]

The i value just indicates how often the last value of each list should be copied:

i   s_1               s_1              s_3
2   [1, 2, 3, 3, 3]   [3, 4, 5, 5, 5]  NaN
1   NaN               [0, 0, 0, 0]     [2, 2]

I am currently using a nested apply loop:

test.apply(lambda x: x.apply(
     lambda y: np.pad(y, (0, x.i), 'constant', constant_values=y[-1]) if type(y)==list else 0), axis=1)

However, this is very slow and if i have a lot of rows (>10.000) the code breaks. This solution seems a bit messy and i'm wondering what the best approach would be to do something like that?

How to apply a function to each row in pandas Dataframe?

11 Ways to Apply a Function to Each Row in Pandas DataFrame Method 1. Loop Over All Rows of a DataFrame. The simplest method to process each row in the good old Python loop. This... Method 2. Iterate over rows with iterrows Function. Instead of processing each row in a Python loop, let’s try ...

What is the difference between apply() and transform() methods in pandas?

The apply () method applies the function along a specified axis. It passes the columns as a dataframe to the custom function, whereas a transform () method passes individual columns as pandas Series to the custom function.

How do I transform a column in a Dataframe in pandas?

Use transform () to Apply a Function to Pandas DataFrame Column In Pandas, columns and dataframes can be transformed and manipulated using methods such as apply () and transform (). The desired transformations are passed in as arguments to the methods as functions. Each method has its subtle differences and utility.

How to use lambda function in pandas Dataframe?

You can apply the lambda function for a single column in the DataFrame. The following example subtracts every cell value by 2 for column A – df ["A"]=df ["A"].apply (lambda x:x-2). Yields below output. Similarly, you can also apply the Lambda function to all & multiple columns in pandas, I will leave this to you to explore. 8.

You can try to extend the lists inplace:

for col in df.loc[:, "s_1":]:
    m = df[col].notna()

    for i, v in zip(df.loc[m, "i"], df.loc[m, col]):
        v.extend([v[-1]] * i)

    df.loc[~m, col] = 0

Benchmark:

from timeit import timeit
from ast import literal_eval


def get_df():
    dfs = []

    # create some big dataframe
    for i in range(5000):
        txt = """
        i   s_1         s_1        s_3
        2   [1, 2, 3]   [3, 4, 5]  NaN
        1   NaN         [0, 0, 0]  [2]  """

        df = pd.read_csv(StringIO(txt), sep=r"\s{2,}", engine="python")

        df.loc[:, "s_1":] = df.loc[:, "s_1":].apply(
            lambda x: [v if pd.isna(v) else literal_eval(v) for v in x]
        )
        dfs.append(df)

    return pd.concat(dfs)


def f1(df):
    for col in df.loc[:, "s_1":]:
        m = df[col].notna()

        for i, v in zip(df.loc[m, "i"], df.loc[m, col]):
            v.extend([v[-1]] * i)

        df.loc[~m, col] = 0
    return df


def f2(df):
    df = df.apply(
        lambda x: x.apply(
            lambda y: np.pad(y, (0, x.i), "constant", constant_values=y[-1])
            if type(y) == list
            else 0
        ),
        axis=1,
    )
    return df


df1 = get_df()
df2 = get_df()

t1 = timeit(lambda: f1(df1), number=1)
t2 = timeit(lambda: f2(df2), number=1)

print(t1)
print(t2)

Prints:

0.01171580795198679
2.3192087680799887

So improvement ~200x

We can stack the datafarme and use list comprehension to pad the values, then unstack back to reshape

s = df.set_index('i', append=True).stack()
s[:] = [v + v[-1:] * r for r, v in zip(s.index.get_level_values(1), s)]
out = s.unstack()

   i              s_1            s_1.1     s_3
0  2  [1, 2, 3, 3, 3]  [3, 4, 5, 5, 5]     NaN
1  1              NaN     [0, 0, 0, 0]  [2, 2]

Performance checks

# Prepare a sample dataframe with 10,000 rows
df = pd.concat([df] * 5000, ignore_index=True)


# Solution with stack and unstack
%%timeit
s = df.set_index('i', append=True).stack()
s[:] = [v + v[-1:] * r for r, v in zip(s.index.get_level_values(1), s)]
_ = s.unstack()
# 38.9 ms ± 700 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)


# OP's solution with apply and np.pad
%%timeit 
df.apply(lambda x: x.apply(
    lambda y: np.pad(y, (0, x.i), 'constant', constant_values=y[-1]) if type(y) == list else 0), axis=1)
# 7.92 s ± 102 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

To avoid itering over rows you might try the following

df = pd.DataFrame( data=[[2, [1, 2, 3],[3, 4,5],None],[1,None,[0, 0, 0] ,[2]]],columns = ['i','s_1','s_2','s_3'])

for col in ['s_1','s_2','s_3']:
    df[col] = df[col] + df['i']*df[col].apply(lambda x : [x[-1]] if type(x)==list else x)

Output

i	s_1	s_2	s_3
2	[1, 2, 3, 3, 3]	[3, 4, 5, 5, 5]	nan
1	nan	[0, 0, 0, 0]	[2, 2]

Apply a function to each cell of a pandas dataframe using information from a particular column

Tags:

python

pandas

numpy

apply

J-H

People also ask

3 Answers

Andrej Kesely

Performance checks

Shubham Sharma

Sebastien D

Recent Activity

Donate For Us

Apply a function to each cell of a pandas dataframe using information from a particular column

Tags:

python

pandas

numpy

apply

J-H

People also ask

3 Answers

Andrej Kesely

Performance checks

Shubham Sharma

Sebastien D

Related questions

Recent Activity

Donate For Us