I have a Pandas dataframe, each row contains a name followed by many numbers in the columns. After a specific index for each row (calculated uniquely in every row), I want to set all the remaining values in that row to 0.
So, I tried out a few things and have the below working code:
for i in range(n):
index = np.where(df.columns == df['match_this_value'][i])[0].item()
df.iloc[i, index] = df['take_this_value'][i].day
df.iloc[i, (index+1):] = 0
However, this takes quite long as my dataset is very large. The runtime is about 70 seconds for my sample dataset, as my entire dataset is much longer. Is there a faster way to do this? Furthermore, is there a better way to do this manipulation without looping through each row?
EDIT: Sorry I should have specified how the index is calculated. the Index is calculated through an np.where by compared all of the columns of the dataframe (for each row) against one specific column and finding the match. so something like:
index = np.where(df.columns == df['match_this_value'][i])[0].item()
Once I have this index, I set the value at that column to the value of another column in the df. The entire code right now looks like this:
for i in range(n):
index = np.where(df.columns == df['match_this_value'][i])[0].item()
df.iloc[i, index] = df['take_this_value'][i].day
df.iloc[i, (index+1):] = 0
pandas.DataFrame.set_index ¶ DataFrame.set_index(keys, drop=True, append=False, inplace=False, verify_integrity=False) [source] ¶ Set the DataFrame index using existing columns. Set the DataFrame index (row labels) using one or more existing columns or arrays (of the correct length).
Depending on your needs, you may use either of the following methods to replace values in Pandas DataFrame: (1) Replace a single value with a new value for an individual DataFrame column: (2) Replace multiple values with a new value for an individual DataFrame column:
Often We start with a huge dataframe in Pandas and after manipulating/filtering the dataframe, we end up with much smaller dataframe. When we look at the smaller dataframe, it might still carry the row index of the original dataframe. If the original index are numbers, now we have indexes that are not continuous.
In this method, we can set multiple columns of the Pandas DataFrame object as its index by creating a list of column names of the DataFrame then passing it to the set_index () function. That’s why in this case, the index is called multi-index.
you could do :
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(4, 4), columns=list('ABCD'))
# A B C D
# 0 0.750017 0.582230 1.411253 -0.379428
# 1 -0.747129 1.800677 -1.243459 -0.098760
# 2 -0.742997 -0.035036 1.012052 -0.767602
# 3 -0.694679 1.013968 -1.000412 0.752191
indexes = np.random.choice(range(df.shape[1]), df.shape[0])
# array([0, 3, 1, 1])
df_indexes = np.tile(range(df.shape[1]), (df.shape[0], 1))
df[df_indexes>indexes[:, None]] = 0
print(df)
# A B C D
# 0 0.750017 0.000000 0.000000 0.00000
# 1 -0.747129 1.800677 -1.243459 -0.09876
# 2 -0.742997 -0.035036 0.000000 0.00000
# 3 -0.694679 1.013968 0.000000 0.00000
So here you include a boolean mask df_indexes>indexes[:, None]
, and indexes
here would be replaced with your "specific indexes"
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With