Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

avoid repeating the dataframe name when operating on pandas columns

Very much a beginner question, sorry: is there a way to avoid repeating the dataframe name when operating on pandas columns?

In R, data.table allows to operate on a column without repeating the dataframe name like this

very_long_dt_name = data.table::data.table(col1=c(1,2,3),col2=c(3,3,1))

# operate on the columns without repeating the dt name:

very_long_dt_name[,ratio:=round(col1/col2,2)]

I couldn't figure out how to do it with pandas in Python so I keep repeating the df name:

data = {'col1': [1,2,3], 'col2': [3, 3, 1]}
very_long_df_name = pd.DataFrame(data)

# operate on the columns requires repeating the df name

very_long_df_name['ratio'] = np.round(very_long_df_name['col1']/very_long_df_name['col2'],2)

I'm sure there's a way to avoid it but I can't find anything on Google. Any hint please? Thank you.

like image 781
Julien Massardier Avatar asked May 28 '19 10:05

Julien Massardier


People also ask

Does Pandas allow duplicate column names?

Index objects are not required to be unique; you can have duplicate row or column labels.

What is the most efficient way to loop through Dataframes with Pandas?

Vectorization is always the first and best choice. You can convert the data frame to NumPy array or into dictionary format to speed up the iteration workflow. Iterating through the key-value pair of dictionaries comes out to be the fastest way with around 280x times speed up for 20 million records.

How do I merge Dataframes without duplicating columns?

merge() function to join the left dataframe with the unique column dataframe using 'inner' join. This will ensure that no columns are duplicated in the merged dataset.

What does Iteritems do in Pandas?

Pandas DataFrame iteritems() Method The iteritems() method generates an iterator object of the DataFrame, allowing us to iterate each column of the DataFrame. Note: This method is the same as the items() method. Each iteration produces a label object and a column object.

How to only keep certain columns in a pandas Dataframe?

You can use the following methods to only keep certain columns in a pandas DataFrame: #drop columns 'col3' and 'col4' df [df.columns[~df.columns.isin( ['col3', 'col4'])]] The following code shows how to define a new DataFrame that only keeps the “team” and “points” columns:

How to remove duplicates from pandas Dataframe?

Steps to Remove Duplicates from Pandas DataFrame. 1 Step 1: Gather the data that contains the duplicates. Firstly, you’ll need to gather the data that contains the duplicates. For example, let’s say ... 2 Step 2: Create Pandas DataFrame. 3 Step 3: Remove duplicates from Pandas DataFrame.

How to apply an IF condition in pandas Dataframe?

Applying an IF condition in Pandas DataFrame. Let’s now review the following 5 cases: (1) IF condition – Set of numbers. Suppose that you created a DataFrame in Python that has 10 numbers (from 1 to 10). You then want to apply the following IF conditions: If the number is equal or lower than 4, then assign the value of ‘True’

How to concatenate two DataFrames in pandas?

In order to perform concatenation of two dataframes, we are going to use the pandas.concat ().drop_duplicates () method in pandas module. Import module. Load two sample dataframes as variables. Concatenate the dataframes using pandas.concat ().drop_duplicates () method.


Video Answer


1 Answers

Try assign:

very_long_df_name.assign(ratio=lambda x: np.round(x.col1/x.col2,2))

Output:

    col1    col2    ratio
0   1       3       0.33
1   2       3       0.67
2   3       1       3.00

Edit: to reflect comments, tests on 1 million rows:

%%timeit
very_long_df_name.assign(ratio = lambda x:x.col1/x.col2)
# 18.6 ms ± 506 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

and

%%timeit
very_long_df_name['ratio'] = very_long_df_name['col1']/very_long_df_name['col2']
# 13.3 ms ± 359 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

And with np.round, assign

%%timeit
very_long_df_name.assign(ratio = lambda x: np.round(x.col1/x.col2,2))
# 64.8 ms ± 958 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

and not-assign:

%%timeit
very_long_df_name['ratio'] = np.round(very_long_df_name['col1']/very_long_df_name['col2'],2)
# 55.8 ms ± 2.43 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

SO it appears that assign is vectorized, just not as well tuned.

like image 108
Quang Hoang Avatar answered Oct 25 '22 19:10

Quang Hoang