I just discovered the assign
method for pandas dataframes, and it looks nice and very similar to dplyr's mutate
in R. However, I've always gotten by by just initializing a new column 'on the fly'. Is there a reason why assign
is better?
For instance (based on the example in the pandas documentation), to create a new column in a dataframe, I could just do this:
df = DataFrame({'A': range(1, 11), 'B': np.random.randn(10)}) df['ln_A'] = np.log(df['A'])
but the pandas.DataFrame.assign
documentation recommends doing this:
df.assign(ln_A = lambda x: np.log(x.A)) # or newcol = np.log(df['A']) df.assign(ln_A=newcol)
Both methods return the same dataframe. In fact, the first method (my 'on the fly' assignment) is significantly faster (0.202 seconds for 1000 iterations) than the .assign
method (0.353 seconds for 1000 iterations).
So is there a reason I should stop using my old method in favour of df.assign
?
Pandas DataFrame: assign() function The assign() function is used to assign new columns to a DataFrame. Returns a new object with all original columns in addition to new ones. Existing columns that are re-assigned will be overwritten. The column names are keywords.
The premise on assign is that it returns: A new DataFrame with the new columns in addition to all the existing columns. And also you cannot do anything in-place to change the original dataframe. The callable must not change input DataFrame (though pandas doesn't check it).
Series can only contain single list with index, whereas dataframe can be made of more than one series or we can say that a dataframe is a collection of series that can be used to analyse the data.
Pandas. apply allow the users to pass a function and apply it on every single value of the Pandas series. It comes as a huge improvement for the pandas library as this function helps to segregate data according to the conditions required due to which it is efficiently used in data science and machine learning.
The difference concerns whether you wish to modify an existing frame, or create a new frame while maintaining the original frame as it was.
In particular, DataFrame.assign
returns you a new object that has a copy of the original data with the requested changes ... the original frame remains unchanged.
In your particular case:
>>> df = DataFrame({'A': range(1, 11), 'B': np.random.randn(10)})
Now suppose you wish to create a new frame in which A
is everywhere 1
without destroying df
. Then you could use .assign
>>> new_df = df.assign(A=1)
If you do not wish to maintain the original values, then clearly df["A"] = 1
will be more appropriate. This also explains the speed difference, by necessity .assign
must copy the data while [...]
does not.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With