Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Removing all extension of a string in list

I've got dictionaries such as:

'1' : ['GAA', 'GAAA', 'GAAAA', 'GAAAAA', 'GAAAAAG', 'GAAAAAGU', 'GAAAAAGUA', 'GAAAAAGUAU', 'GAAAAAGUAUG', 'GAAAAAGUAUGC', 'GAAAAAGUAUGCA', 'GAAAAAGUAUGCAA', 'GAAAAAGUAUGCAAG', 'GAAAAAGUAUGCAAGA', 'GAAAAAGUAUGCAAGAA', 'GAAAAAGUAUGCAAGAAC']

'2' : ['GAG', 'GAGA', 'GAGAG', 'GAGAGA', 'GAGAGAG', 'GAGAGAGA', 'GAGAGAGAC', 'GAGAGAGACA', 'GAGAGAGACAU', 'GAGAGAGACAUA', 'GAGAGAGACAUAG', 'GAGAGAGACAUAGA', 'GAGAGAGACAUAGAG', 'GAGAGAGACAUAGAGG']

'3' : ['GUC', 'GUCU', 'GUCUU', 'GUCUUU', 'GUCUUUG', 'GUCUUUGU', 'GUCUUUGU"', 'GUCUUUGU"G', 'GUCUUUGU"GU', 'GUCUUUGU"GUA', 'GUCUUUGU"GUAC', 'GUCUUUGU"GUACA', 'GUCUUUGU"GUACAU', 'GUCUUUGU"GUACAUC']

I am trying to make it so that the program can find the shortest substring in the list (such as GAA in the first) and use it to find all other entries that are simply extensions of GAA (strings that start with GAA and just have extra letters) and removes them.

I know there's been plenty of questions asked here about how to remove items from list, but none help me out in regards to this problem.

like image 566
lamazibiji Avatar asked Dec 08 '15 05:12

lamazibiji


2 Answers

>>> dictionary={ '1': ['GAA', 'GAAA', 'GAAAA', 'GAAAAA', 'GAAAAAG', 'GAAAAAGU',
                    'GAAAAAGUA', 'GAAAAAGUAU', 'GAAAAAGUAUG', 'GAAAAAGUAUGC', 
                    'GAAAAAGUAUGCA', 'GAAAAAGUAUGCAA', 'GAAAAAGUAUGCAAG', 
                    'GAAAAAGUAUGCAAGA', 'GAAAAAGUAUGCAAGAA', 'GAAAAAGUAUGCAAGAAC', 
                    'RTRSRS','GAG', 'GAGA', 'GAGAG', 'GAGAGA', 'GAGAGAG', 'GAGAGAGA',
                  'GAGAGAGAC', 'GAGAGAGACA', 'GAGAGAGACAU', 'GAGAGAGACAUA', 
                  'GAGAGAGACAUAG', 'GAGAGAGACAUAGA', 'GAGAGAGACAUAGAG',
                  'GAGAGAGACAUAGAGG']}
>>> new_dict = {}

>>> for i in dictionary:
        l = len(min(dictionary[i], key=len))
        m = [x for x in dictionary[i] if len(x)==l]
        temp = []
        temp.extend(m)
        for k in dictionary[i]:
            if not any(map(lambda j: k.startswith(j), m)):
                temp.append(k)
        new_dict[i] = temp

>>> print(new_dict)
# {'1': ['GAA', 'GAG', 'RTRSRS']}
like image 163
Ayush Avatar answered Sep 24 '22 12:09

Ayush


Your sample data is not really good. All other entries start with the shortest string. Hence, all would be removed. Here a shorter version with a different entry:

data = {'1' : ['GAA', 'xxxxxxx', 'GAAA', 'GAAAA', 'GAAAAA'],
        '2' : ['GAG', 'yyyyyyyy', 'GAGA', 'GAGAG', 'GAGAGA'],
        '3' : ['GUC', 'zzzzzz', 'GUCU', 'GUCUU', 'GUCUUU']}

Now:

res = {}
for key, value in data.items():
    shortest = min(value, key=len)
    res[key] = [entry for entry in value if not entry.startswith(shortest) 
                or entry == shortest]

>>> res
{'1': ['GAA', 'xxxxxxx'], '2': ['GAG', 'yyyyyyyy'], '3': ['GUC', 'zzzzzz']}

Note: This also keeps the position of the shortest string relative to the others that remain. Just in case this matters.

like image 20
Mike Müller Avatar answered Sep 22 '22 12:09

Mike Müller