I've got dictionaries such as:
'1' : ['GAA', 'GAAA', 'GAAAA', 'GAAAAA', 'GAAAAAG', 'GAAAAAGU', 'GAAAAAGUA', 'GAAAAAGUAU', 'GAAAAAGUAUG', 'GAAAAAGUAUGC', 'GAAAAAGUAUGCA', 'GAAAAAGUAUGCAA', 'GAAAAAGUAUGCAAG', 'GAAAAAGUAUGCAAGA', 'GAAAAAGUAUGCAAGAA', 'GAAAAAGUAUGCAAGAAC']
'2' : ['GAG', 'GAGA', 'GAGAG', 'GAGAGA', 'GAGAGAG', 'GAGAGAGA', 'GAGAGAGAC', 'GAGAGAGACA', 'GAGAGAGACAU', 'GAGAGAGACAUA', 'GAGAGAGACAUAG', 'GAGAGAGACAUAGA', 'GAGAGAGACAUAGAG', 'GAGAGAGACAUAGAGG']
'3' : ['GUC', 'GUCU', 'GUCUU', 'GUCUUU', 'GUCUUUG', 'GUCUUUGU', 'GUCUUUGU"', 'GUCUUUGU"G', 'GUCUUUGU"GU', 'GUCUUUGU"GUA', 'GUCUUUGU"GUAC', 'GUCUUUGU"GUACA', 'GUCUUUGU"GUACAU', 'GUCUUUGU"GUACAUC']
I am trying to make it so that the program can find the shortest substring in the list (such as GAA in the first) and use it to find all other entries that are simply extensions of GAA (strings that start with GAA and just have extra letters) and removes them.
I know there's been plenty of questions asked here about how to remove items from list, but none help me out in regards to this problem.
>>> dictionary={ '1': ['GAA', 'GAAA', 'GAAAA', 'GAAAAA', 'GAAAAAG', 'GAAAAAGU',
'GAAAAAGUA', 'GAAAAAGUAU', 'GAAAAAGUAUG', 'GAAAAAGUAUGC',
'GAAAAAGUAUGCA', 'GAAAAAGUAUGCAA', 'GAAAAAGUAUGCAAG',
'GAAAAAGUAUGCAAGA', 'GAAAAAGUAUGCAAGAA', 'GAAAAAGUAUGCAAGAAC',
'RTRSRS','GAG', 'GAGA', 'GAGAG', 'GAGAGA', 'GAGAGAG', 'GAGAGAGA',
'GAGAGAGAC', 'GAGAGAGACA', 'GAGAGAGACAU', 'GAGAGAGACAUA',
'GAGAGAGACAUAG', 'GAGAGAGACAUAGA', 'GAGAGAGACAUAGAG',
'GAGAGAGACAUAGAGG']}
>>> new_dict = {}
>>> for i in dictionary:
l = len(min(dictionary[i], key=len))
m = [x for x in dictionary[i] if len(x)==l]
temp = []
temp.extend(m)
for k in dictionary[i]:
if not any(map(lambda j: k.startswith(j), m)):
temp.append(k)
new_dict[i] = temp
>>> print(new_dict)
# {'1': ['GAA', 'GAG', 'RTRSRS']}
Your sample data is not really good. All other entries start with the shortest string. Hence, all would be removed. Here a shorter version with a different entry:
data = {'1' : ['GAA', 'xxxxxxx', 'GAAA', 'GAAAA', 'GAAAAA'],
'2' : ['GAG', 'yyyyyyyy', 'GAGA', 'GAGAG', 'GAGAGA'],
'3' : ['GUC', 'zzzzzz', 'GUCU', 'GUCUU', 'GUCUUU']}
Now:
res = {}
for key, value in data.items():
shortest = min(value, key=len)
res[key] = [entry for entry in value if not entry.startswith(shortest)
or entry == shortest]
>>> res
{'1': ['GAA', 'xxxxxxx'], '2': ['GAG', 'yyyyyyyy'], '3': ['GUC', 'zzzzzz']}
Note: This also keeps the position of the shortest string relative to the others that remain. Just in case this matters.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With