Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas DataFrame change a value based on column, index values comparison

Suppose that you have a pandas DataFrame which has some kind of data in the body and numbers in the column and index names.

>>> data=np.array([['a', 'b', 'c'], ['d', 'e', 'f'], ['g', 'h', 'i']])
>>> columns = [2, 4, 8]
>>> index = [10, 4, 2]
>>> df = pd.DataFrame(data, columns=columns, index=index)
>>> df
    2  4  8
10  a  b  c
4   d  e  f
2   g  h  i

Now suppose we want to manipulate are data frame in some kind of way based on comparing the index and columns. Consider the following.

Where index is greater than column replace letter with 'k':

    2  4  8
10  k  k  k
4   k  e  f
2   g  h  i

Where index is equal to column replace letter with 'U':

    2  4  8
10  k  k  k
4   k  U  f
2   U  h  i

Where column is greater than index replace letter with 'Y':

    2  4  8
10  k  k  k
4   k  U  Y
2   U  Y  Y

To keep the question useful to all:

  • What is a fast way to do this replacement?

  • What is the simplest way to do this replacement?

Speed Results from minimal example

  • jezrael: 556 µs ± 66.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

  • user3471881: 329 µs ± 11.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

  • thunderwood: 4.65 ms ± 252 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


Is this a duplicate? I searched google for pandas replace compare index column and the top results are:

Pandas - Compare two dataframes and replace values matching condition

Python pandas: replace values based on location not index value

Pandas DataFrame: replace all values in a column, based on condition

However, I don't feel any of these touch on whether this a) possible or b) how to compare in such a way

like image 528
akozi Avatar asked Nov 02 '18 12:11

akozi


People also ask

How do I change values in a column in pandas based on condition?

You can replace values of all or selected columns based on the condition of pandas DataFrame by using DataFrame. loc[ ] property. The loc[] is used to access a group of rows and columns by label(s) or a boolean array. It can access and can also manipulate the values of pandas DataFrame.

How do you replace a value in a DataFrame using an index?

You can easily replace a value in pandas data frames by just specifying its column and its index. Having the dataframe above, we will replace some of its values. We are using the loc function of pandas. The first variable is the index of the value we want to replace and the second is its column.


1 Answers

I think you need numpy.select with broadcasting:

m1 = df.index.values[:, None] > df.columns.values
m2 = df.index.values[:, None] == df.columns.values


df = pd.DataFrame(np.select([m1, m2], ['k','U'], 'Y'), columns=df.columns, index=df.index)
print (df)
    2  4  8
10  k  k  k
4   k  U  Y
2   U  Y  Y

Performance:

np.random.seed(1000)

N = 1000
a = np.random.randint(100, size=N)
b = np.random.randint(100, size=N)

df = pd.DataFrame(np.random.choice(list('abcdefgh'), size=(N, N)), columns=a, index=b)
#print (df)

def us(df):
    values = np.array(np.array([df.index]).transpose() - np.array([df.columns]), dtype='object')
    greater = values > 0
    less = values < 0
    same = values == 0

    values[greater] = 'k'
    values[less] = 'Y'
    values[same] = 'U'


    return pd.DataFrame(values, columns=df.columns, index=df.index)

def jez(df):

    m1 = df.index.values[:, None] > df.columns.values
    m2 = df.index.values[:, None] == df.columns.values
    return pd.DataFrame(np.select([m1, m2], ['k','U'], 'Y'), columns=df.columns, index=df.index)

In [236]: %timeit us(df)
107 ms ± 358 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [237]: %timeit jez(df)
64 ms ± 299 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
like image 111
jezrael Avatar answered Oct 06 '22 18:10

jezrael