Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Delete string if there is a longer string that starts with same pattern

Tags:

python

I have a list of values:


['27',
 '27.1',
 '27.1.a',
 '27.1.b',
 '27.3.d.28',
 '27.3.d.28.1',
 '27.3.d.28.2',
]

I want to keep only the longest values for each "parent" value. That is, delete 27 if there is a 27.something. Delete 27.1. if there is a 27.1.something, etc.

In this case, output should be

['27.1.a',
 '27.1.b',
 '27.3.d.28.1',
 '27.3.d.28.2',
]

Background info: this is a dataset in which 27 contains the sum of all 27.sth, so I only want to keep the most dissagregated data.

I know that I can find the longest string by

lis = ['27',
 '27.1',
 '27.1.a',
 '27.1.b',
 '27.3.d.28',
 '27.3.d.28.1',
 '27.3.d.28.2',
]

ls = max(len(x) for x in lis)       
[x for x in lis if len(x) == le] 

But I don't know how to do this tree search. I guess something with regex could work but I have no idea where to start. Any hint is appreciated.

I also saw this idea which works well if I had only one level (one period), but I don't know how to apply it to my case.

like image 701
Ignacio Avatar asked Sep 16 '25 22:09

Ignacio


2 Answers

Assuming the match is anchored on the left, a strategy might be to sort the strings, thus the strings will cluster by prefix, then only compare the successive pairs:

def is_parent(p, target):
    return p.startswith(target) and len(p)>len(target) and p[len(target)] == '.'

out = []
prev = ''
for s in sorted(lis)+['']:
    if prev and not is_parent(s, prev):
        out.append(prev)
    prev = s

print(out)

Output: ['27.1.a', '27.1.b', '27.3.d.28.1', '27.3.d.28.2']

like image 190
mozway Avatar answered Sep 18 '25 11:09

mozway


I don't think that tree search is required here. You only have to check whether a list value is indeed a longest value, i.e., it is not contained in another value. The following calculation should do what you want (which could be more efficient I guess).

values = ['27', '27.1', '27.1.a', '27.1.b', '27.3.d.28', '27.3.d.28.1', '27.3.d.28.2']

longest_values = []
for v1 in values:
    # check whether v1 is a longest value
    is_longest_value = True

    # this is the case when v1 is not "in" (string compare) any other value
    for v2 in values:
        if v1 in v2 and len(v1) < len(v2):
            # v1 is not a longest value
            is_longest_value = False
            break

    if is_longest_value:
        longest_values.append(v1)

Applying this on the given list results in longest_values = ['27.1.a', '27.1.b', '27.3.d.28.1', '27.3.d.28.2'].

like image 27
josch14 Avatar answered Sep 18 '25 10:09

josch14