Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to efficiently add multiple columns to pandas dataframe with values that depend on other columns

What I have:

  • A dataframe with many rows, and several existing columns (python, pandas).
  • Python 3.6, so a solution that relies on that particular version is fine with me (but obviously solutions that also work for earlier versions are fine too)

What I want to do:

  • Add multiple additional columns to the dataframe, where the values in these new columns all depend on some way on values in existing columns in the same row.
  • The original order of the dataframe must be preserved. If a solution changes the ordering, I could restore it afterwards by manually sorting based on one of the existing columns, but obviously this introduces extra overhead.

I already have the following code, which does work correctly. However, profiling has indicated that this code is one of the important bottlenecks in my code, so I'd like to optimize it if possible, and I also have reason to believe that should be possible:

df["NewColumn1"] = df.apply(lambda row: compute_new_column1_value(row), axis=1)
df["NewColumn2"] = df.apply(lambda row: compute_new_column2_value(row), axis=1)

# a few more lines of code like the above

I based this solution on answers to questions like this one (which is a question similar to mine, but specifically about adding one new column, whereas my question is about adding many new columns). I suppose that each of these df.apply() calls is internally implemented with a loop through all the rows, and I suspect it should be possible to optimize this with a solution that only loops through all the loops once (as opposed to once per column I want to add).

In other answers, I have seen references to the assign() function, which does indeed support adding multiple columns at once. I tried using this in the following way:

# WARNING: this does NOT work
df = df.assign(
    NewColumn1=lambda row: compute_new_column1_value(row),
    NewColumn2=lambda row: compute_new_column2_value(row),
    # more lines like the two above
)

The reason why this doesn't work is because the lambda's actually don't receive rows of the dataframe as arguments at all, they simply seem to get the entire dataframe at once. And then it's expected for each of the lambda's to return a complete column/Series/array of values at once. So, my problem here is that I'd have to end up implementing manual loops through all the loops myself inside those lambda's, which is obviously going to be even worse for performance.

I can think of two solutions conceptually, but have been unable to find how to actually implement them so far:

  1. Something like df.assign() (which supports adding multiple columns at once), but with the ability to pass rows into the lambda's instead of the complete dataframe

  2. A way to vectorize my compute_new_columnX_value() functions, so that they can be used as lambda's in the way that df.assign() expects them to be used.

My problem with the second solution so far is that the row-based versions some of my functions look as follows, and I have difficulties finding how to properly vectorize them:

def compute_new_column1_value(row):
    if row["SomeExistingColumn"] in some_dictionary:
        return some_dictionary[row["SomeExistingColumn"]]
    else:
        return some_default_value
like image 479
Dennis Soemers Avatar asked Jan 19 '18 15:01

Dennis Soemers


3 Answers

Have you tried initializing the columns as nan, iterating through the dataframe by row, and assigning the values with loc?

import numpy as np
import pandas as pd

df = pd.DataFrame(np.random.randint(0, 20, (10, 5)))

df[5] = np.nan
df[6] = np.nan

for i, row in df.iterrows():
    df.loc[i, 5] = row[1] + row[4]
    df.loc[i, 6] = row[3] * 2

print(df)

yields

    0   1   2   3   4
0  17   4   3  11  10
1  16   1  14  11  16
2   4  18  12  19   7
3  11   3   7  10   5
4  11   0  10   1  17
5   5  17  10   3   8
6   0   0   7   3   6
7   7  18  18  13   8
8  16   4  12  11  16
9  13   9  15   8  19

    0   1   2   3   4     5     6
0  17   4   3  11  10  14.0  22.0
1  16   1  14  11  16  17.0  22.0
2   4  18  12  19   7  25.0  38.0
3  11   3   7  10   5   8.0  20.0
4  11   0  10   1  17  17.0   2.0
5   5  17  10   3   8  25.0   6.0
6   0   0   7   3   6   6.0   6.0
7   7  18  18  13   8  26.0  26.0
8  16   4  12  11  16  20.0  22.0
9  13   9  15   8  19  28.0  16.0
like image 78
Eric Ed Lohmar Avatar answered Oct 19 '22 22:10

Eric Ed Lohmar


I am really taken by this question so here is another example involving external dictionaries:

import pandas as pd
import numpy as np

# Create data frame and external dictionaries
rnd = pd.Series(np.random.randint(10, 100, 10))

names = 'Rafael Roger Grigor Alexander Dominic Marin David Jack Stan Pablo'
name = names.split(' ')

surnames = 'Nadal Federer Dimitrov Zverev Thiem Cilic Goffin Sock Wawrinka Busta'
surname = surnames.split()

countries_str = ('Spain Switzerland Bulgaria Germany Austria Croatia Belgium USA Switzerland Spain')
country = countries_str.split(' ')

player = dict(zip(name, surname))
player_country = dict(zip(name, country))

df = pd.DataFrame(
        {'A': rnd, 'B': 100, 'Name': name, 'Points': np.nan, 'Surname': np.nan, 'Country': np.nan})

df = df[['A', 'B', 'Name', 'Surname', 'Country', 'Points']]
df.loc[9, 'Name'] = 'Dennis'

print(df)

# Functions to fill the empty columns
def f1():
    return df['A'].mul(df['B'])

def f2():
    return np.random.randint(1, 10)

def f3():
    return player[key]

def f4():
    return player_country[key]

def f5():
    return 'Unknown'

def f6():
    return 0

# .assign() dependent on a boolean mask
for key, value in player.items():
    df[df['Name'] == key] = df[df['Name'] == key].assign(
            Surname = f3(), Country = f4(), Points = f1())

df[df['Name']=='Dennis'] = df[df['Name'] == 'Dennis'].assign(
        Surname = f5(), Country = f5(), Points = f6())
df = df.sort_values('Points', ascending=False)
print(df)

     A      B       Name   Surname      Country  Points
1  97.0  100.0      Roger   Federer  Switzerland  9700.0
4  93.0  100.0    Dominic     Thiem      Austria  9300.0
8  92.0  100.0       Stan  Wawrinka  Switzerland  9200.0
5  86.0  100.0      Marin     Cilic      Croatia  8600.0
6  67.0  100.0      David    Goffin      Belgium  6700.0
7  61.0  100.0       Jack      Sock          USA  6100.0
0  35.0  100.0     Rafael     Nadal        Spain  3500.0
2  34.0  100.0     Grigor  Dimitrov     Bulgaria  3400.0
3  25.0  100.0  Alexander    Zverev      Germany  2500.0
9  48.0  100.0     Dennis   Unknown      Unknown     0.0
like image 21
KRKirov Avatar answered Oct 19 '22 23:10

KRKirov


The answers provided so far do not provide a speedup for my specific case, for reasons I provided in the comments. The best solution I've been able to find so far is primarily based on this answer to another question. It didn't provide me a large speedup (about 10%), but it's the best I've been able to do so far. I'd still be very much interested in faster solutions if they exist!

It turns out that, like the assign function, apply can in fact also be provided with lambda's that return a series of values for multiple columns at once, instead of only lambda's that return a single scalar. So, the fastest implementation I have so far looks as follows:

# first initialize all the new columns with standard values for entire df at once
# this turns out to be very important. Skipping this comes at a high computational cost
for new_column in ["NewColumn1", "NewColumn2", "etc."]:
    df[new_column] = np.nan

df = df.apply(compute_all_new_columns, axis=1)

And then, instead of having all those separate lambda's for all the different new columns, they're all implemented in the same function like this:

def compute_all_new_columns(row):
    if row["SomeExistingColumn"] in some_dictionary:
        row["NewColumn1"] = some_dictionary[row["SomeExistingColumn"]]
    else:
        row["NewColumn1"] = some_default_value

    if some_other_condition:
        row["NewColumn2"] = whatever
    else:
        row["NewColumn2"] = row["SomeExistingColumn"] * whatever

    # assign values to other new columns here

The resulting dataframe contains all the columns it previously did, plus values for all the new columns as inserted on a row-by-row basis by the compute_all_new_columns function. The original ordering is preserved. This solution contains no python-based loops (which are slow), and only a single loop through the rows ''behind the scenes'' as provided to us by the pandas apply function

like image 22
Dennis Soemers Avatar answered Oct 19 '22 23:10

Dennis Soemers