I have a large dataframe df
(~100 columns and ~7 million rows) and I need to create ~50 new variables / columns which are simple transformations of the current variables. One way to proceed would be with many .apply
statements (I'm just using transform*
as a placeholder for simple transformations such as max
or squaring):
df['new_var1'] = df['old_var1'].apply(lambda x : transform1(x))
...
df['new_var50'] = df['old_var50'].apply(lambda x : transform50(x))
Another way would be to first create a dictionary
transform_dict = {
'new_var1' : lambda row : transform1(row),
...,
'new_var50' : lambda row : transform50(row)
}
and then write one .apply
combined with .concat
:
df = pd.concat([df,
df.apply(lambda r: pd.Series({var : transform_dict[var](r) for var in transform_dict.keys()}), axis=1)], axis=1)
Is one method preferred over the other, either in how 'Pythonic' it is, or efficiency, scalability, flexibility?
Starting with:
df = pd.DataFrame(np.random.random((1000, 100)))
Adding individual columns:
def cols_via_apply(df):
for i in range(100, 150):
df[i] = df[i-100].apply(lambda x: x * i)
return df
%timeit cols_via_apply(df)
10 loops, best of 3: 29.6 ms per loop
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1000 entries, 0 to 999
Columns: 150 entries, 0 to 149
dtypes: float64(150)
memory usage: 1.2 MB
None
seems quite a bit more efficient than using pd.concat
- presumably because there's a loop over the rows
of the DataFrame
involved. So this difference will get worse as the DataFrame
gets longer:
def cols_via_concat(df):
df = pd.concat([df, df.apply(lambda row: pd.Series({i : i * row[i-100] for i in range(100, 150)}), axis=1)])
return df
%timeit cols_via_concat(df)
1 loops, best of 3: 450 ms per loop
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1000 entries, 0 to 999
Columns: 150 entries, 0 to 149
dtypes: float64(150)
memory usage: 1.2 MB
None
Continuing the experiment of @Stefan but with a size of 100k x 100 and with a new method that first allocates a block of NaNs and concatenates it to the dataframe. It then uses iloc
to perform calculations on each column.
def cols_via_iloc(df):
df = pd.concat([df, pd.DataFrame(np.tile(np.nan, [len(df), 50]))], axis=1)
for i in range(100, 150):
df.iloc[:, i] = i * df.iloc[:, i - 100]
def cols_via_apply(df):
for i in range(100, 150):
df[i] = df[i-100].apply(lambda x: x * i)
return df
def cols_via_concat(df):
df = pd.concat([df, df.apply(lambda row: pd.Series({i : i * row[i - 100]
for i in range(100, 150)}), axis=1)])
return df
>>> %%timeit df = pd.DataFrame(np.random.randn(100000, 100))
cols_via_iloc(df)
1 loops, best of 3: 540 ms per loop
>>> %%timeit df = pd.DataFrame(np.random.randn(100000, 100))
cols_via_apply(df)
1 loops, best of 3: 2.91 s per loop
>>> %%timeit df = pd.DataFrame(np.random.randn(100000, 100))
cols_via_concat(df)
1 loops, best of 3: 55.8 s per loop
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With