Pandas startswith operation between two columns

Question

I have a pandas dataframe with two columns, where I need to check where the value at each row of column A is a string that starts with the value of the corresponding row at column B or viceversa.

It seems that the Series method .str.startswith cannot deal with vectorized input, so I needed to zip over the two columns in a list comprehension and create a new pd.Series with the same index as any of the two columns.

I would like this to be a vectorized operation with the .str accessor available to operate on iterables, but something like this returns NaN:

df = pd.DataFrame(data={'a':['x','yy'], 'b':['xyz','uvw']})
df['a'].str.startswith(df['b'])

while my working solution is the following:

pd.Series(index=df.index, data=[a.startswith(b) or b.startswith(a) for a,b in zip(df['a'],df['b'])])

I suspect that there may be a better way to tackle this issue as it also would benefit all string methods on series.

Is there any more beautiful or efficient method to do this?

jezrael · Accepted Answer

One idea is use np.vecorize, but because working with strings performance is only a bit better like your solution:

def fun (a,b):
    return a.startswith(b) or b.startswith(a)

f = np.vectorize(fun)
a = pd.Series(f(df['a'],df['b']), index=df.index)
print (a)
0     True
1    False
dtype: bool

df = pd.DataFrame(data={'a':['x','yy'], 'b':['xyz','uvw']})
df = pd.concat([df] * 10000, ignore_index=True)

In [132]: %timeit pd.Series(index=df.index, data=[a.startswith(b) or b.startswith(a) for a,b in df[['a', 'b']].to_numpy()])
42.3 ms ± 516 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [133]: %timeit pd.Series(f(df['a'],df['b']), index=df.index)
9.81 ms ± 119 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [134]: %timeit pd.Series(index=df.index, data=[a.startswith(b) or b.startswith(a) for a,b in zip(df['a'],df['b'])])
14.1 ms ± 262 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

#sammywemmy solution
In [135]: %timeit pd.Series([any((a.startswith(b), b.startswith(a))) for a, b in df.to_numpy()], index=df.index)
46.3 ms ± 683 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Pandas startswith operation between two columns

Tags:

python

string

pandas

vectorization

linello

1 Answers

jezrael

Recent Activity

Donate For Us

Pandas startswith operation between two columns

Tags:

python

string

pandas

vectorization

linello

1 Answers

jezrael

Related questions

Recent Activity

Donate For Us