Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why use pandas.assign rather than simply initialize new column?

Tags:

python

pandas

I just discovered the assign method for pandas dataframes, and it looks nice and very similar to dplyr's mutate in R. However, I've always gotten by by just initializing a new column 'on the fly'. Is there a reason why assign is better?

For instance (based on the example in the pandas documentation), to create a new column in a dataframe, I could just do this:

df = DataFrame({'A': range(1, 11), 'B': np.random.randn(10)}) df['ln_A'] = np.log(df['A']) 

but the pandas.DataFrame.assign documentation recommends doing this:

df.assign(ln_A = lambda x: np.log(x.A)) # or  newcol = np.log(df['A']) df.assign(ln_A=newcol) 

Both methods return the same dataframe. In fact, the first method (my 'on the fly' assignment) is significantly faster (0.202 seconds for 1000 iterations) than the .assign method (0.353 seconds for 1000 iterations).

So is there a reason I should stop using my old method in favour of df.assign?

like image 418
sacuL Avatar asked Jan 09 '18 23:01

sacuL


People also ask

What does pandas assign do?

Pandas DataFrame: assign() function The assign() function is used to assign new columns to a DataFrame. Returns a new object with all original columns in addition to new ones. Existing columns that are re-assigned will be overwritten. The column names are keywords.

Does Panda assign inplace?

The premise on assign is that it returns: A new DataFrame with the new columns in addition to all the existing columns. And also you cannot do anything in-place to change the original dataframe. The callable must not change input DataFrame (though pandas doesn't check it).

What is the difference between a pandas series and a single column DataFrame?

Series can only contain single list with index, whereas dataframe can be made of more than one series or we can say that a dataframe is a collection of series that can be used to analyse the data.

Why use pandas apply?

Pandas. apply allow the users to pass a function and apply it on every single value of the Pandas series. It comes as a huge improvement for the pandas library as this function helps to segregate data according to the conditions required due to which it is efficiently used in data science and machine learning.


1 Answers

The difference concerns whether you wish to modify an existing frame, or create a new frame while maintaining the original frame as it was.

In particular, DataFrame.assign returns you a new object that has a copy of the original data with the requested changes ... the original frame remains unchanged.

In your particular case:

>>> df = DataFrame({'A': range(1, 11), 'B': np.random.randn(10)}) 

Now suppose you wish to create a new frame in which A is everywhere 1 without destroying df. Then you could use .assign

>>> new_df = df.assign(A=1) 

If you do not wish to maintain the original values, then clearly df["A"] = 1 will be more appropriate. This also explains the speed difference, by necessity .assign must copy the data while [...] does not.

like image 118
donkopotamus Avatar answered Sep 18 '22 15:09

donkopotamus