With a data like this one
import pandas as pd
tcd = pd.DataFrame({
'a': {'p_1': 1, 'p_2': 1, 'p_3': 0, 'p_4': 0},
'b': {'p_1': 0, 'p_2': 1, 'p_3': 1, 'p_4': 1},
'c': {'p_1': 0, 'p_2': 0, 'p_3': 1, 'p_4': 0}})
tcd
# a b c
# p_1 1 0 0
# p_2 1 1 0
# p_3 0 1 1
# p_4 0 1 0
(but with 40e3 columns)
I look for a vectorized way to put the boolean and in a result Series:
a & b = ab -> 1 or True a & c = ac -> 0 or False
1 0 0 1 0 0
1 1 0 1 0 0
0 1 1 0 1 0
0 1 0 0 0 0
For now I only get an ugly solution with a for loop::
res = pd.Series(index=['a&a', 'a&b', 'a&c'])
for i in range(3):
res[i] = (tcd.iloc[:, 0] & tcd.iloc[:, i]).any()
res
aa 1
ab 1
ac 0
with the B.M. answer I get this
def get_shared_p(tcd, i):
res = (tcd.iloc[:, i][:, None] & tcd).any()
res.index += '&_{}'.format(i)
return res
res = pd.DataFrame(columns=range(cols), index=range(cols))
for col_i in range(cols):
res.iloc[:, col_i] = list(get_shared_p(tcd, col_i))
print res
# 0 1 2
# 0 True True False
# 1 True True True
# 2 False True True
We can probably avoid this new for loop.
From what I measured (shown below in some experiments), using np. vectorize() is 25x faster (or more) than using the DataFrame function apply() , at least on my 2016 MacBook Pro.
There's a better way. It's called PyArrow — an amazing Python binding for the Apache Arrow project. It introduces faster data read/write times and doesn't otherwise interfere with your data analysis pipeline. It's the best of both worlds, as you can still use Pandas for further calculations.
Pandas DataFrame: applymap() function The applymap() function is used to apply a function to a Dataframe elementwise. This method applies a function that accepts and returns a scalar to every element of a DataFrame. Python function, returns a single value from a single value.
You can use np.logical_and
and numpy's broadcasting.
Say you define x
and y
as the first column, and entire matrix, respectively:
import numpy as np
x = tcd.as_matrix()
y = tcd.a.values.reshape((len(tcd), 1))
now, using broadcasting, find the logical and of x
and y
, and place it in and_
:
and_ = np.logical_and(x, y)
Finally, find if any of the rows in any of the columns is true:
>>> np.sum(and_) > 0
array([ True, True, False], dtype=bool)
Use [:,None]
to align data and force broadcasting :
In[1] : res=(tcd.a[:,None] & tcd).any(); res.index+='&a'; res
Out[1]:
a&a True
b&a True
c&a False
dtype: bool
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With