Suppose that you have a pandas DataFrame
which has some kind of data in the body and numbers in the column
and index
names.
>>> data=np.array([['a', 'b', 'c'], ['d', 'e', 'f'], ['g', 'h', 'i']])
>>> columns = [2, 4, 8]
>>> index = [10, 4, 2]
>>> df = pd.DataFrame(data, columns=columns, index=index)
>>> df
2 4 8
10 a b c
4 d e f
2 g h i
Now suppose we want to manipulate are data frame in some kind of way based on comparing the index and columns. Consider the following.
Where index is greater than column replace letter with 'k':
2 4 8
10 k k k
4 k e f
2 g h i
Where index is equal to column replace letter with 'U':
2 4 8
10 k k k
4 k U f
2 U h i
Where column is greater than index replace letter with 'Y':
2 4 8
10 k k k
4 k U Y
2 U Y Y
To keep the question useful to all:
What is a fast way to do this replacement?
What is the simplest way to do this replacement?
Speed Results from minimal example
jezrael: 556 µs ± 66.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
user3471881: 329 µs ± 11.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
thunderwood: 4.65 ms ± 252 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Is this a duplicate?
I searched google for pandas replace compare index column
and the top results are:
Pandas - Compare two dataframes and replace values matching condition
Python pandas: replace values based on location not index value
Pandas DataFrame: replace all values in a column, based on condition
However, I don't feel any of these touch on whether this a) possible or b) how to compare in such a way
You can replace values of all or selected columns based on the condition of pandas DataFrame by using DataFrame. loc[ ] property. The loc[] is used to access a group of rows and columns by label(s) or a boolean array. It can access and can also manipulate the values of pandas DataFrame.
You can easily replace a value in pandas data frames by just specifying its column and its index. Having the dataframe above, we will replace some of its values. We are using the loc function of pandas. The first variable is the index of the value we want to replace and the second is its column.
I think you need numpy.select
with broadcasting:
m1 = df.index.values[:, None] > df.columns.values
m2 = df.index.values[:, None] == df.columns.values
df = pd.DataFrame(np.select([m1, m2], ['k','U'], 'Y'), columns=df.columns, index=df.index)
print (df)
2 4 8
10 k k k
4 k U Y
2 U Y Y
Performance:
np.random.seed(1000)
N = 1000
a = np.random.randint(100, size=N)
b = np.random.randint(100, size=N)
df = pd.DataFrame(np.random.choice(list('abcdefgh'), size=(N, N)), columns=a, index=b)
#print (df)
def us(df):
values = np.array(np.array([df.index]).transpose() - np.array([df.columns]), dtype='object')
greater = values > 0
less = values < 0
same = values == 0
values[greater] = 'k'
values[less] = 'Y'
values[same] = 'U'
return pd.DataFrame(values, columns=df.columns, index=df.index)
def jez(df):
m1 = df.index.values[:, None] > df.columns.values
m2 = df.index.values[:, None] == df.columns.values
return pd.DataFrame(np.select([m1, m2], ['k','U'], 'Y'), columns=df.columns, index=df.index)
In [236]: %timeit us(df)
107 ms ± 358 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [237]: %timeit jez(df)
64 ms ± 299 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With