I have a list of values:
['27',
'27.1',
'27.1.a',
'27.1.b',
'27.3.d.28',
'27.3.d.28.1',
'27.3.d.28.2',
]
I want to keep only the longest values for each "parent" value. That is, delete 27 if there is a 27.something. Delete 27.1. if there is a 27.1.something, etc.
In this case, output should be
['27.1.a',
'27.1.b',
'27.3.d.28.1',
'27.3.d.28.2',
]
Background info: this is a dataset in which 27 contains the sum of all 27.sth, so I only want to keep the most dissagregated data.
I know that I can find the longest string by
lis = ['27',
'27.1',
'27.1.a',
'27.1.b',
'27.3.d.28',
'27.3.d.28.1',
'27.3.d.28.2',
]
ls = max(len(x) for x in lis)
[x for x in lis if len(x) == le]
But I don't know how to do this tree search. I guess something with regex could work but I have no idea where to start. Any hint is appreciated.
I also saw this idea which works well if I had only one level (one period), but I don't know how to apply it to my case.
Assuming the match is anchored on the left, a strategy might be to sort the strings, thus the strings will cluster by prefix, then only compare the successive pairs:
def is_parent(p, target):
return p.startswith(target) and len(p)>len(target) and p[len(target)] == '.'
out = []
prev = ''
for s in sorted(lis)+['']:
if prev and not is_parent(s, prev):
out.append(prev)
prev = s
print(out)
Output: ['27.1.a', '27.1.b', '27.3.d.28.1', '27.3.d.28.2']
I don't think that tree search is required here. You only have to check whether a list value is indeed a longest value, i.e., it is not contained in another value. The following calculation should do what you want (which could be more efficient I guess).
values = ['27', '27.1', '27.1.a', '27.1.b', '27.3.d.28', '27.3.d.28.1', '27.3.d.28.2']
longest_values = []
for v1 in values:
# check whether v1 is a longest value
is_longest_value = True
# this is the case when v1 is not "in" (string compare) any other value
for v2 in values:
if v1 in v2 and len(v1) < len(v2):
# v1 is not a longest value
is_longest_value = False
break
if is_longest_value:
longest_values.append(v1)
Applying this on the given list results in longest_values = ['27.1.a', '27.1.b', '27.3.d.28.1', '27.3.d.28.2']
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With