Python Pandas. Delete cells whose value is contained in another cell in the same column

Question

I have a dataframe like this:

A           B
exa         3
example     6
exam        4
hello       4
hell        3

I want to delete the rows that are substrings of another row and keep the longest one (Notice that B is already the length of A)

I want my table to look like this:

A           B
example     6
hello       4

I thought about the following boolean filter but it does not work :(

df['Check'] = df.apply(lambda row: df.count(row['A'] in row['A'])>1, axis=1)

cs95 · Accepted Answer

This is non-trivial. But we can take advantage of B to sort the data, compare each value with only those strings larger than itself for solution slightly better than O(N^2).

df = df.sort_values('B')
v = df['A'].tolist()

df[[not any(b.startswith(a) for b in v[i + 1:]) for i, a in enumerate(v)]].sort_index()

         A  B
1  example  6
3    hello  4

BENY · Answer

Like what cold provided my solution is O(m*n) as well (In your case m=n)

df[np.sum(np.array([[y in x for x in df.A.values] for y in df.A.values]),1)==1]
Out[30]: 
         A  B
1  example  6
3    hello  4

Python Pandas. Delete cells whose value is contained in another cell in the same column

Tags:

python

pandas

duplicates

apply

Paola Mejia

2 Answers

cs95

BENY

Recent Activity

Donate For Us

Python Pandas. Delete cells whose value is contained in another cell in the same column

Tags:

python

pandas

duplicates

apply

Paola Mejia

2 Answers

cs95

BENY

Related questions

Recent Activity

Donate For Us