Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas startswith operation between two columns

I have a pandas dataframe with two columns, where I need to check where the value at each row of column A is a string that starts with the value of the corresponding row at column B or viceversa.

It seems that the Series method .str.startswith cannot deal with vectorized input, so I needed to zip over the two columns in a list comprehension and create a new pd.Series with the same index as any of the two columns.

I would like this to be a vectorized operation with the .str accessor available to operate on iterables, but something like this returns NaN:

df = pd.DataFrame(data={'a':['x','yy'], 'b':['xyz','uvw']})
df['a'].str.startswith(df['b'])

while my working solution is the following:

pd.Series(index=df.index, data=[a.startswith(b) or b.startswith(a) for a,b in zip(df['a'],df['b'])])

I suspect that there may be a better way to tackle this issue as it also would benefit all string methods on series.

Is there any more beautiful or efficient method to do this?

like image 456
linello Avatar asked Sep 18 '25 02:09

linello


1 Answers

One idea is use np.vecorize, but because working with strings performance is only a bit better like your solution:

def fun (a,b):
    return a.startswith(b) or b.startswith(a)

f = np.vectorize(fun)
a = pd.Series(f(df['a'],df['b']), index=df.index)
print (a)
0     True
1    False
dtype: bool

df = pd.DataFrame(data={'a':['x','yy'], 'b':['xyz','uvw']})
df = pd.concat([df] * 10000, ignore_index=True)

In [132]: %timeit pd.Series(index=df.index, data=[a.startswith(b) or b.startswith(a) for a,b in df[['a', 'b']].to_numpy()])
42.3 ms ± 516 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [133]: %timeit pd.Series(f(df['a'],df['b']), index=df.index)
9.81 ms ± 119 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [134]: %timeit pd.Series(index=df.index, data=[a.startswith(b) or b.startswith(a) for a,b in zip(df['a'],df['b'])])
14.1 ms ± 262 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

#sammywemmy solution
In [135]: %timeit pd.Series([any((a.startswith(b), b.startswith(a))) for a, b in df.to_numpy()], index=df.index)
46.3 ms ± 683 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
like image 91
jezrael Avatar answered Sep 19 '25 17:09

jezrael