I know two ways of adding a new column to pandas dataframe
df_new = df.assign(new_column=default_value)
and
df[new_column] = default_value
The first one does not add columns inplace, but the second one does. So, which one is more efficient to use?
Apart from these two is there is any all the more efficient method than these?
In pandas you can add/append a new column to the existing DataFrame using DataFrame. insert() method, this method updates the existing DataFrame with a new column. DataFrame. assign() is also used to insert a new column however, this method returns a new Dataframe after adding a new column.
Technique 1: insert() Method Now, to add new columns to the existing DataFrame, we have to use the insert() method.
I think second one, assign
is used if want nice code witch chaining all functions - one line code:
df = pd.DataFrame({'A':np.random.rand(10000)})
default_value = 10
In [114]: %timeit df_new = df.assign(new_column=default_value)
228 µs ± 4.26 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [115]: %timeit df['new_column'] = default_value
86.1 µs ± 654 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
I use perfplot for ploting:
import perfplot
default_value = 10
def chained(df):
df = df.assign(new_column=default_value)
return df
def no_chained(df):
df['new_column'] = default_value
return df
def make_df(n):
df = pd.DataFrame({'A':np.random.rand(n)})
return df
perfplot.show(
setup=make_df,
kernels=[chained, no_chained],
n_range=[2**k for k in range(2, 25)],
logx=True,
logy=True,
equality_check=False,
xlabel='len(df)')
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With