Very much a beginner question, sorry: is there a way to avoid repeating the dataframe name when operating on pandas columns?
In R, data.table allows to operate on a column without repeating the dataframe name like this
very_long_dt_name = data.table::data.table(col1=c(1,2,3),col2=c(3,3,1))
# operate on the columns without repeating the dt name:
very_long_dt_name[,ratio:=round(col1/col2,2)]
I couldn't figure out how to do it with pandas in Python so I keep repeating the df name:
data = {'col1': [1,2,3], 'col2': [3, 3, 1]}
very_long_df_name = pd.DataFrame(data)
# operate on the columns requires repeating the df name
very_long_df_name['ratio'] = np.round(very_long_df_name['col1']/very_long_df_name['col2'],2)
I'm sure there's a way to avoid it but I can't find anything on Google. Any hint please? Thank you.
Index objects are not required to be unique; you can have duplicate row or column labels.
Vectorization is always the first and best choice. You can convert the data frame to NumPy array or into dictionary format to speed up the iteration workflow. Iterating through the key-value pair of dictionaries comes out to be the fastest way with around 280x times speed up for 20 million records.
merge() function to join the left dataframe with the unique column dataframe using 'inner' join. This will ensure that no columns are duplicated in the merged dataset.
Pandas DataFrame iteritems() Method The iteritems() method generates an iterator object of the DataFrame, allowing us to iterate each column of the DataFrame. Note: This method is the same as the items() method. Each iteration produces a label object and a column object.
You can use the following methods to only keep certain columns in a pandas DataFrame: #drop columns 'col3' and 'col4' df [df.columns[~df.columns.isin( ['col3', 'col4'])]] The following code shows how to define a new DataFrame that only keeps the “team” and “points” columns:
Steps to Remove Duplicates from Pandas DataFrame. 1 Step 1: Gather the data that contains the duplicates. Firstly, you’ll need to gather the data that contains the duplicates. For example, let’s say ... 2 Step 2: Create Pandas DataFrame. 3 Step 3: Remove duplicates from Pandas DataFrame.
Applying an IF condition in Pandas DataFrame. Let’s now review the following 5 cases: (1) IF condition – Set of numbers. Suppose that you created a DataFrame in Python that has 10 numbers (from 1 to 10). You then want to apply the following IF conditions: If the number is equal or lower than 4, then assign the value of ‘True’
In order to perform concatenation of two dataframes, we are going to use the pandas.concat ().drop_duplicates () method in pandas module. Import module. Load two sample dataframes as variables. Concatenate the dataframes using pandas.concat ().drop_duplicates () method.
Try assign
:
very_long_df_name.assign(ratio=lambda x: np.round(x.col1/x.col2,2))
Output:
col1 col2 ratio
0 1 3 0.33
1 2 3 0.67
2 3 1 3.00
Edit: to reflect comments, tests on 1 million rows:
%%timeit
very_long_df_name.assign(ratio = lambda x:x.col1/x.col2)
# 18.6 ms ± 506 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
and
%%timeit
very_long_df_name['ratio'] = very_long_df_name['col1']/very_long_df_name['col2']
# 13.3 ms ± 359 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
And with np.round
, assign
%%timeit
very_long_df_name.assign(ratio = lambda x: np.round(x.col1/x.col2,2))
# 64.8 ms ± 958 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
and not-assign
:
%%timeit
very_long_df_name['ratio'] = np.round(very_long_df_name['col1']/very_long_df_name['col2'],2)
# 55.8 ms ± 2.43 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
SO it appears that assign is vectorized, just not as well tuned.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With