I have a dataframe like this:
A B
exa 3
example 6
exam 4
hello 4
hell 3
I want to delete the rows that are substrings of another row and keep the longest one (Notice that B is already the length of A)
I want my table to look like this:
A B
example 6
hello 4
I thought about the following boolean filter but it does not work :(
df['Check'] = df.apply(lambda row: df.count(row['A'] in row['A'])>1, axis=1)
This is non-trivial. But we can take advantage of B to sort the data, compare each value with only those strings larger than itself for solution slightly better than O(N^2).
df = df.sort_values('B')
v = df['A'].tolist()
df[[not any(b.startswith(a) for b in v[i + 1:]) for i, a in enumerate(v)]].sort_index()
A B
1 example 6
3 hello 4
Like what cold provided my solution is O(m*n) as well (In your case m=n)
df[np.sum(np.array([[y in x for x in df.A.values] for y in df.A.values]),1)==1]
Out[30]:
A B
1 example 6
3 hello 4
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With