I want to expand the list entries of a dataframe using the information in column i
:
i s_1 s_1 s_3
2 [1, 2, 3] [3, 4, 5] NaN
1 NaN [0, 0, 0] [2]
The i value just indicates how often the last value of each list should be copied:
i s_1 s_1 s_3
2 [1, 2, 3, 3, 3] [3, 4, 5, 5, 5] NaN
1 NaN [0, 0, 0, 0] [2, 2]
I am currently using a nested apply loop:
test.apply(lambda x: x.apply(
lambda y: np.pad(y, (0, x.i), 'constant', constant_values=y[-1]) if type(y)==list else 0), axis=1)
However, this is very slow and if i have a lot of rows (>10.000) the code breaks. This solution seems a bit messy and i'm wondering what the best approach would be to do something like that?
11 Ways to Apply a Function to Each Row in Pandas DataFrame Method 1. Loop Over All Rows of a DataFrame. The simplest method to process each row in the good old Python loop. This... Method 2. Iterate over rows with iterrows Function. Instead of processing each row in a Python loop, let’s try ...
The apply () method applies the function along a specified axis. It passes the columns as a dataframe to the custom function, whereas a transform () method passes individual columns as pandas Series to the custom function.
Use transform () to Apply a Function to Pandas DataFrame Column In Pandas, columns and dataframes can be transformed and manipulated using methods such as apply () and transform (). The desired transformations are passed in as arguments to the methods as functions. Each method has its subtle differences and utility.
You can apply the lambda function for a single column in the DataFrame. The following example subtracts every cell value by 2 for column A – df ["A"]=df ["A"].apply (lambda x:x-2). Yields below output. Similarly, you can also apply the Lambda function to all & multiple columns in pandas, I will leave this to you to explore. 8.
You can try to extend the lists inplace:
for col in df.loc[:, "s_1":]:
m = df[col].notna()
for i, v in zip(df.loc[m, "i"], df.loc[m, col]):
v.extend([v[-1]] * i)
df.loc[~m, col] = 0
Benchmark:
from timeit import timeit
from ast import literal_eval
def get_df():
dfs = []
# create some big dataframe
for i in range(5000):
txt = """
i s_1 s_1 s_3
2 [1, 2, 3] [3, 4, 5] NaN
1 NaN [0, 0, 0] [2] """
df = pd.read_csv(StringIO(txt), sep=r"\s{2,}", engine="python")
df.loc[:, "s_1":] = df.loc[:, "s_1":].apply(
lambda x: [v if pd.isna(v) else literal_eval(v) for v in x]
)
dfs.append(df)
return pd.concat(dfs)
def f1(df):
for col in df.loc[:, "s_1":]:
m = df[col].notna()
for i, v in zip(df.loc[m, "i"], df.loc[m, col]):
v.extend([v[-1]] * i)
df.loc[~m, col] = 0
return df
def f2(df):
df = df.apply(
lambda x: x.apply(
lambda y: np.pad(y, (0, x.i), "constant", constant_values=y[-1])
if type(y) == list
else 0
),
axis=1,
)
return df
df1 = get_df()
df2 = get_df()
t1 = timeit(lambda: f1(df1), number=1)
t2 = timeit(lambda: f2(df2), number=1)
print(t1)
print(t2)
Prints:
0.01171580795198679
2.3192087680799887
So improvement ~200x
We can stack
the datafarme and use list comprehension to pad the values, then unstack
back to reshape
s = df.set_index('i', append=True).stack()
s[:] = [v + v[-1:] * r for r, v in zip(s.index.get_level_values(1), s)]
out = s.unstack()
i s_1 s_1.1 s_3
0 2 [1, 2, 3, 3, 3] [3, 4, 5, 5, 5] NaN
1 1 NaN [0, 0, 0, 0] [2, 2]
# Prepare a sample dataframe with 10,000 rows
df = pd.concat([df] * 5000, ignore_index=True)
# Solution with stack and unstack
%%timeit
s = df.set_index('i', append=True).stack()
s[:] = [v + v[-1:] * r for r, v in zip(s.index.get_level_values(1), s)]
_ = s.unstack()
# 38.9 ms ± 700 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
# OP's solution with apply and np.pad
%%timeit
df.apply(lambda x: x.apply(
lambda y: np.pad(y, (0, x.i), 'constant', constant_values=y[-1]) if type(y) == list else 0), axis=1)
# 7.92 s ± 102 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
To avoid itering over rows you might try the following
df = pd.DataFrame( data=[[2, [1, 2, 3],[3, 4,5],None],[1,None,[0, 0, 0] ,[2]]],columns = ['i','s_1','s_2','s_3'])
for col in ['s_1','s_2','s_3']:
df[col] = df[col] + df['i']*df[col].apply(lambda x : [x[-1]] if type(x)==list else x)
Output
i | s_1 | s_2 | s_3 |
---|---|---|---|
2 | [1, 2, 3, 3, 3] | [3, 4, 5, 5, 5] | nan |
1 | nan | [0, 0, 0, 0] | [2, 2] |
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With