Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Comparing two columns of pandas dataframe by np.char.find gives TypeError: string operation on non-string array

I wish to compare two series of strings to find if one contains the other element-wise.

I first tried with apply, but it is slow:

cols = ['s1','s2']
list_of_series = [pd.Series(['one','sdf'],index=cols), pd.Series(['two','x y two'],index=cols)]
df = pd.DataFrame(list_of_series, columns=cols)
df
    s1  s2
0   one sdf
1   two x y two

df.apply(lambda row: row['s1'] in row['s2'], axis=1)
0    False
1    True 
dtype: bool

It seems to work with the following code:

x=np.array(['one','two'])
y=np.array(['sdf','x y two'])

np.char.find(y,x)
array([-1,  4])

but if I have a dataframe, I get an error:

np.char.find(df.s2.values,df.s1.values)
TypeError: string operation on non-string array

Can someone advise a solution?

like image 800
Timothée HENRY Avatar asked Oct 26 '18 15:10

Timothée HENRY


1 Answers

Using find from numpy.core and add astype str

from numpy.core.defchararray import find
find(df.s2.values.astype(str),df.s1.values.astype(str))!=-1
Out[430]: array([False,  True])
like image 101
BENY Avatar answered Oct 19 '22 03:10

BENY