Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Vectorized "and" for pandas columns

With a data like this one

import pandas as pd
tcd = pd.DataFrame({
 'a': {'p_1': 1, 'p_2': 1, 'p_3': 0, 'p_4': 0}, 
 'b': {'p_1': 0, 'p_2': 1, 'p_3': 1, 'p_4': 1}, 
 'c': {'p_1': 0, 'p_2': 0, 'p_3': 1, 'p_4': 0}})
tcd
#      a  b  c
# p_1  1  0  0
# p_2  1  1  0
# p_3  0  1  1
# p_4  0  1  0

(but with 40e3 columns)

I look for a vectorized way to put the boolean and in a result Series:

a & b = ab -> 1 or True    a & c = ac -> 0 or False
1   0   0                  1   0   0
1   1   0                  1   0   0
0   1   1                  0   1   0
0   1   0                  0   0   0

For now I only get an ugly solution with a for loop::

res = pd.Series(index=['a&a', 'a&b', 'a&c'])
for i in range(3):
  res[i] = (tcd.iloc[:, 0] & tcd.iloc[:, i]).any()

res 
aa    1
ab    1
ac    0

with the B.M. answer I get this

def get_shared_p(tcd, i):
    res = (tcd.iloc[:, i][:, None] & tcd).any()
    res.index += '&_{}'.format(i)
    return res

res = pd.DataFrame(columns=range(cols), index=range(cols))
for col_i in range(cols):
    res.iloc[:, col_i] = list(get_shared_p(tcd, col_i))

print res
#        0     1      2
# 0   True  True  False
# 1   True  True   True
# 2  False  True   True

We can probably avoid this new for loop.

like image 531
user3313834 Avatar asked Jan 25 '16 20:01

user3313834


People also ask

Is NP vectorize faster than apply?

From what I measured (shown below in some experiments), using np. vectorize() is 25x faster (or more) than using the DataFrame function apply() , at least on my 2016 MacBook Pro.

Is PyArrow faster than Pandas?

There's a better way. It's called PyArrow — an amazing Python binding for the Apache Arrow project. It introduces faster data read/write times and doesn't otherwise interfere with your data analysis pipeline. It's the best of both worlds, as you can still use Pandas for further calculations.

What does Applymap do in Pandas?

Pandas DataFrame: applymap() function The applymap() function is used to apply a function to a Dataframe elementwise. This method applies a function that accepts and returns a scalar to every element of a DataFrame. Python function, returns a single value from a single value.


2 Answers

You can use np.logical_and and numpy's broadcasting.

Say you define x and y as the first column, and entire matrix, respectively:

import numpy as np

x = tcd.as_matrix()
y = tcd.a.values.reshape((len(tcd), 1))

now, using broadcasting, find the logical and of x and y, and place it in and_:

and_ = np.logical_and(x, y)

Finally, find if any of the rows in any of the columns is true:

>>> np.sum(and_) > 0
array([ True,  True, False], dtype=bool)
like image 156
Ami Tavory Avatar answered Sep 26 '22 05:09

Ami Tavory


Use [:,None] to align data and force broadcasting :

In[1] : res=(tcd.a[:,None] & tcd).any(); res.index+='&a'; res

Out[1]:
a&a     True
b&a     True
c&a    False
dtype: bool
like image 26
B. M. Avatar answered Sep 22 '22 05:09

B. M.