What I have:
What I want to do:
I already have the following code, which does work correctly. However, profiling has indicated that this code is one of the important bottlenecks in my code, so I'd like to optimize it if possible, and I also have reason to believe that should be possible:
df["NewColumn1"] = df.apply(lambda row: compute_new_column1_value(row), axis=1)
df["NewColumn2"] = df.apply(lambda row: compute_new_column2_value(row), axis=1)
# a few more lines of code like the above
I based this solution on answers to questions like this one (which is a question similar to mine, but specifically about adding one new column, whereas my question is about adding many new columns). I suppose that each of these df.apply()
calls is internally implemented with a loop through all the rows, and I suspect it should be possible to optimize this with a solution that only loops through all the loops once (as opposed to once per column I want to add).
In other answers, I have seen references to the assign() function, which does indeed support adding multiple columns at once. I tried using this in the following way:
# WARNING: this does NOT work
df = df.assign(
NewColumn1=lambda row: compute_new_column1_value(row),
NewColumn2=lambda row: compute_new_column2_value(row),
# more lines like the two above
)
The reason why this doesn't work is because the lambda's actually don't receive rows of the dataframe as arguments at all, they simply seem to get the entire dataframe at once. And then it's expected for each of the lambda's to return a complete column/Series/array of values at once. So, my problem here is that I'd have to end up implementing manual loops through all the loops myself inside those lambda's, which is obviously going to be even worse for performance.
I can think of two solutions conceptually, but have been unable to find how to actually implement them so far:
Something like df.assign()
(which supports adding multiple columns at once), but with the ability to pass rows into the lambda's instead of the complete dataframe
A way to vectorize my compute_new_columnX_value()
functions, so that they can be used as lambda's in the way that df.assign()
expects them to be used.
My problem with the second solution so far is that the row-based versions some of my functions look as follows, and I have difficulties finding how to properly vectorize them:
def compute_new_column1_value(row):
if row["SomeExistingColumn"] in some_dictionary:
return some_dictionary[row["SomeExistingColumn"]]
else:
return some_default_value
Have you tried initializing the columns as nan
, iterating through the dataframe by row, and assigning the values with loc
?
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.randint(0, 20, (10, 5)))
df[5] = np.nan
df[6] = np.nan
for i, row in df.iterrows():
df.loc[i, 5] = row[1] + row[4]
df.loc[i, 6] = row[3] * 2
print(df)
yields
0 1 2 3 4
0 17 4 3 11 10
1 16 1 14 11 16
2 4 18 12 19 7
3 11 3 7 10 5
4 11 0 10 1 17
5 5 17 10 3 8
6 0 0 7 3 6
7 7 18 18 13 8
8 16 4 12 11 16
9 13 9 15 8 19
0 1 2 3 4 5 6
0 17 4 3 11 10 14.0 22.0
1 16 1 14 11 16 17.0 22.0
2 4 18 12 19 7 25.0 38.0
3 11 3 7 10 5 8.0 20.0
4 11 0 10 1 17 17.0 2.0
5 5 17 10 3 8 25.0 6.0
6 0 0 7 3 6 6.0 6.0
7 7 18 18 13 8 26.0 26.0
8 16 4 12 11 16 20.0 22.0
9 13 9 15 8 19 28.0 16.0
I am really taken by this question so here is another example involving external dictionaries:
import pandas as pd
import numpy as np
# Create data frame and external dictionaries
rnd = pd.Series(np.random.randint(10, 100, 10))
names = 'Rafael Roger Grigor Alexander Dominic Marin David Jack Stan Pablo'
name = names.split(' ')
surnames = 'Nadal Federer Dimitrov Zverev Thiem Cilic Goffin Sock Wawrinka Busta'
surname = surnames.split()
countries_str = ('Spain Switzerland Bulgaria Germany Austria Croatia Belgium USA Switzerland Spain')
country = countries_str.split(' ')
player = dict(zip(name, surname))
player_country = dict(zip(name, country))
df = pd.DataFrame(
{'A': rnd, 'B': 100, 'Name': name, 'Points': np.nan, 'Surname': np.nan, 'Country': np.nan})
df = df[['A', 'B', 'Name', 'Surname', 'Country', 'Points']]
df.loc[9, 'Name'] = 'Dennis'
print(df)
# Functions to fill the empty columns
def f1():
return df['A'].mul(df['B'])
def f2():
return np.random.randint(1, 10)
def f3():
return player[key]
def f4():
return player_country[key]
def f5():
return 'Unknown'
def f6():
return 0
# .assign() dependent on a boolean mask
for key, value in player.items():
df[df['Name'] == key] = df[df['Name'] == key].assign(
Surname = f3(), Country = f4(), Points = f1())
df[df['Name']=='Dennis'] = df[df['Name'] == 'Dennis'].assign(
Surname = f5(), Country = f5(), Points = f6())
df = df.sort_values('Points', ascending=False)
print(df)
A B Name Surname Country Points
1 97.0 100.0 Roger Federer Switzerland 9700.0
4 93.0 100.0 Dominic Thiem Austria 9300.0
8 92.0 100.0 Stan Wawrinka Switzerland 9200.0
5 86.0 100.0 Marin Cilic Croatia 8600.0
6 67.0 100.0 David Goffin Belgium 6700.0
7 61.0 100.0 Jack Sock USA 6100.0
0 35.0 100.0 Rafael Nadal Spain 3500.0
2 34.0 100.0 Grigor Dimitrov Bulgaria 3400.0
3 25.0 100.0 Alexander Zverev Germany 2500.0
9 48.0 100.0 Dennis Unknown Unknown 0.0
The answers provided so far do not provide a speedup for my specific case, for reasons I provided in the comments. The best solution I've been able to find so far is primarily based on this answer to another question. It didn't provide me a large speedup (about 10%), but it's the best I've been able to do so far. I'd still be very much interested in faster solutions if they exist!
It turns out that, like the assign
function, apply
can in fact also be provided with lambda's that return a series of values for multiple columns at once, instead of only lambda's that return a single scalar. So, the fastest implementation I have so far looks as follows:
# first initialize all the new columns with standard values for entire df at once
# this turns out to be very important. Skipping this comes at a high computational cost
for new_column in ["NewColumn1", "NewColumn2", "etc."]:
df[new_column] = np.nan
df = df.apply(compute_all_new_columns, axis=1)
And then, instead of having all those separate lambda's for all the different new columns, they're all implemented in the same function like this:
def compute_all_new_columns(row):
if row["SomeExistingColumn"] in some_dictionary:
row["NewColumn1"] = some_dictionary[row["SomeExistingColumn"]]
else:
row["NewColumn1"] = some_default_value
if some_other_condition:
row["NewColumn2"] = whatever
else:
row["NewColumn2"] = row["SomeExistingColumn"] * whatever
# assign values to other new columns here
The resulting dataframe contains all the columns it previously did, plus values for all the new columns as inserted on a row-by-row basis by the compute_all_new_columns
function. The original ordering is preserved. This solution contains no python-based loops (which are slow), and only a single loop through the rows ''behind the scenes'' as provided to us by the pandas apply
function
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With