I just started learning python and here I have a sorted list of protein sequences (total 59,000 sequences) and some of them overlap. I have made a toy list here for example: <pre class="prettyprint"><code>ABCDE ABCDEFG ABCDEFGH ABCDEFGHIJKLMNO CEST DBTSFDE DBTSFDEO EOEUDNBNUW EOEUDNBNUWD EAEUDNBNUW FEOEUDNBNUW FG FGH </code></pre> I would like to remove those shorter overlap and just keep the longest one so the desired output would look like this: <pre class="prettyprint"><code>ABCDEFGHIJKLMNO CEST DBTSFDEO EAEUDNBNUW FEOEUDNBNUWD FGH </code></pre> How can I do it? My code looks like this: <pre class="prettyprint"><code>with open('toy.txt' ,'r') as f: pattern = f.read().splitlines() print pattern for i in range(0, len(pattern)): if pattern[i] in pattern[i+1]: pattern.remove(pattern[i]) print pattern </code></pre> And I got the error message: <pre class="prettyprint"><code>['ABCDE', 'ABCDEFG', 'ABCDEFGH', 'ABCDEFGHIJKLMNO', 'CEST', 'DBTSFDE', 'DBTSFDEO', 'EOEUDNBNUW', 'EAEUDNBNUW', 'FG', 'FGH'] ['ABCDEFG', 'ABCDEFGH', 'ABCDEFGHIJKLMNO', 'CEST', 'DBTSFDE', 'DBTSFDEO', 'EOEUDNBNUW', 'EAEUDNBNUW', 'FG', 'FGH'] ['ABCDEFG', 'ABCDEFGHIJKLMNO', 'CEST', 'DBTSFDE', 'DBTSFDEO', 'EOEUDNBNUW', 'EAEUDNBNUW', 'FG', 'FGH'] ['ABCDEFG', 'ABCDEFGHIJKLMNO', 'CEST', 'DBTSFDE', 'DBTSFDEO', 'EOEUDNBNUW', 'EAEUDNBNUW', 'FG', 'FGH'] ['ABCDEFG', 'ABCDEFGHIJKLMNO', 'CEST', 'DBTSFDEO', 'EOEUDNBNUW', 'EAEUDNBNUW', 'FG', 'FGH'] ['ABCDEFG', 'ABCDEFGHIJKLMNO', 'CEST', 'DBTSFDEO', 'EOEUDNBNUW', 'EAEUDNBNUW', 'FG', 'FGH'] ['ABCDEFG', 'ABCDEFGHIJKLMNO', 'CEST', 'DBTSFDEO', 'EOEUDNBNUW', 'EAEUDNBNUW', 'FG', 'FGH'] ['ABCDEFG', 'ABCDEFGHIJKLMNO', 'CEST', 'DBTSFDEO', 'EOEUDNBNUW', 'EAEUDNBNUW', 'FGH'] Traceback (most recent call last): File "test.py", line 8, in <module> if pattern[i] in pattern[i+1]: IndexError: list index out of range </code></pre>

There is other working answers, but none of them explain your actual problem. you were actually really close of a valid solution and what is, in my opinion, the most readable answer. The error came from the fact that you were mutating the same list while checking for index using <code>range()</code>. Thus, while increasing the <code>i</code> variable you were removing item from the list which at one point causes the <code>index error</code> inevitably. Therefore, here is a working version of your initial code with some changes, <pre class="prettyprint"><code>pattern = ["ABCDE","ABCDEFG","ABCDEFGH","ABCDEFGHIJKLMNO","CEST","DBTSFDE","DBTSFDEO","EOEUDNBNUW","EAEUDNBNUW","FG","FGH"] output_pattern = [] for i in range(0, (len(pattern)-1)): if not pattern[i] in pattern[i+1]: output_pattern.append(pattern[i]) # Adding the last item output_pattern.append(pattern[-1]) print (output_pattern) >>>> ['ABCDEFGHIJKLMNO', 'CEST', 'DBTSFDEO', 'EOEUDNBNUW', 'EAEUDNBNUW', 'FGH'] </code></pre> Note that this code will work if your list is previously sorted as you mentioned in comment section. What is this code doing ? Basically, it use the same logic of your initial answer where it iterates on the list and check if the next item contains the current item. But, using another list and iterating until the before last item, will fix your index problem. But now comes a question, What should I do with the last item ? Since the list is sorted, you can consider the last item as always being unique. This is why I'm using <pre class="prettyprint"><code>output_pattern.append(pattern[-1]) </code></pre> which adds the last item of the initial list. Important note This answer was written in response to OP's initial question where he wanted to keep the longer overlap and I quote based on the next item in same list. As stated by @Chris_Rands if your concerns are related to a biological task and need to find any overlap, this solution is not suited for your needs. Example where this code would fail to recognize a potential overlap, <pre class="prettyprint"><code>pattern = ["ACD", "AD", "BACD"] </code></pre> where it would output the same result without removing the possible <code>"ACD"</code> overlap. Now, just as a clarification though, this would imply a much more complex algorithm and I initially thought it was out of the scope of the question's requirements. If ever this is your case, I may be completely wrong here, but I truly think a C++ implementation seems more appropriate. have a look at the CD-Hit algorithm suggested by @Chris_Rands in the comment section.

You could use <code>groupby()</code> and <code>max()</code> to help here: <pre class="prettyprint"><code>from itertools import groupby with open('toy.txt') as f_input: for key, group in groupby(f_input, lambda x: x[:2]): print(max(group, key=lambda x: len(x)).strip()) </code></pre> This would display: <pre class="prettyprint"><code>ABCDEFGHIJKLMNO CEST DBTSFDEO EOEUDNBNUW EAEUDNBNUW FGH </code></pre> <code>groupby()</code> works by returning a list of matching items based on a function, in this case consecutive lines with the same first 2 characters. The <code>max()</code> function then takes this list and returns the list item with the longest length.

Remove item from list based on the next item in same list

Tags:

I just started learning python and here I have a sorted list of protein sequences (total 59,000 sequences) and some of them overlap. I have made a toy list here for example:

ABCDE
ABCDEFG
ABCDEFGH
ABCDEFGHIJKLMNO
CEST
DBTSFDE
DBTSFDEO
EOEUDNBNUW
EOEUDNBNUWD
EAEUDNBNUW
FEOEUDNBNUW
FG
FGH

I would like to remove those shorter overlap and just keep the longest one so the desired output would look like this:

ABCDEFGHIJKLMNO
CEST
DBTSFDEO
EAEUDNBNUW
FEOEUDNBNUWD
FGH

How can I do it? My code looks like this:

with open('toy.txt' ,'r') as f:
    pattern = f.read().splitlines()
    print pattern

    for i in range(0, len(pattern)):
        if pattern[i] in pattern[i+1]:
            pattern.remove(pattern[i])
        print pattern

And I got the error message:

['ABCDE', 'ABCDEFG', 'ABCDEFGH', 'ABCDEFGHIJKLMNO', 'CEST', 'DBTSFDE', 'DBTSFDEO', 'EOEUDNBNUW', 'EAEUDNBNUW', 'FG', 'FGH']
['ABCDEFG', 'ABCDEFGH', 'ABCDEFGHIJKLMNO', 'CEST', 'DBTSFDE', 'DBTSFDEO', 'EOEUDNBNUW', 'EAEUDNBNUW', 'FG', 'FGH']
['ABCDEFG', 'ABCDEFGHIJKLMNO', 'CEST', 'DBTSFDE', 'DBTSFDEO', 'EOEUDNBNUW', 'EAEUDNBNUW', 'FG', 'FGH']
['ABCDEFG', 'ABCDEFGHIJKLMNO', 'CEST', 'DBTSFDE', 'DBTSFDEO', 'EOEUDNBNUW', 'EAEUDNBNUW', 'FG', 'FGH']
['ABCDEFG', 'ABCDEFGHIJKLMNO', 'CEST', 'DBTSFDEO', 'EOEUDNBNUW', 'EAEUDNBNUW', 'FG', 'FGH']
['ABCDEFG', 'ABCDEFGHIJKLMNO', 'CEST', 'DBTSFDEO', 'EOEUDNBNUW', 'EAEUDNBNUW', 'FG', 'FGH']
['ABCDEFG', 'ABCDEFGHIJKLMNO', 'CEST', 'DBTSFDEO', 'EOEUDNBNUW', 'EAEUDNBNUW', 'FG', 'FGH']
['ABCDEFG', 'ABCDEFGHIJKLMNO', 'CEST', 'DBTSFDEO', 'EOEUDNBNUW', 'EAEUDNBNUW', 'FGH']
Traceback (most recent call last):
  File "test.py", line 8, in <module>
    if pattern[i] in pattern[i+1]:
IndexError: list index out of range

567

asked Jul 13 '18 14:07

Kenny

2 Answers

There is other working answers, but none of them explain your actual problem. you were actually really close of a valid solution and what is, in my opinion, the most readable answer.

The error came from the fact that you were mutating the same list while checking for index using range().

Thus, while increasing the i variable you were removing item from the list which at one point causes the index error inevitably.

Therefore, here is a working version of your initial code with some changes,

pattern = ["ABCDE","ABCDEFG","ABCDEFGH","ABCDEFGHIJKLMNO","CEST","DBTSFDE","DBTSFDEO","EOEUDNBNUW","EAEUDNBNUW","FG","FGH"]
output_pattern = []


for i in range(0, (len(pattern)-1)):
    if not pattern[i] in pattern[i+1]:
        output_pattern.append(pattern[i]) 

# Adding the last item
output_pattern.append(pattern[-1])   
print (output_pattern)

>>>> ['ABCDEFGHIJKLMNO', 'CEST', 'DBTSFDEO', 'EOEUDNBNUW', 'EAEUDNBNUW', 'FGH']

Note that this code will work if your list is previously sorted as you mentioned in comment section.

What is this code doing ?

Basically, it use the same logic of your initial answer where it iterates on the list and check if the next item contains the current item. But, using another list and iterating until the before last item, will fix your index problem. But now comes a question,

What should I do with the last item ?

Since the list is sorted, you can consider the last item as always being unique. This is why I'm using

output_pattern.append(pattern[-1])

which adds the last item of the initial list.

Important note

This answer was written in response to OP's initial question where he wanted to keep the longer overlap and I quote based on the next item in same list. As stated by @Chris_Rands if your concerns are related to a biological task and need to find any overlap, this solution is not suited for your needs.

Example where this code would fail to recognize a potential overlap,

pattern = ["ACD", "AD", "BACD"]

where it would output the same result without removing the possible "ACD" overlap. Now, just as a clarification though, this would imply a much more complex algorithm and I initially thought it was out of the scope of the question's requirements. If ever this is your case, I may be completely wrong here, but I truly think a C++ implementation seems more appropriate. have a look at the CD-Hit algorithm suggested by @Chris_Rands in the comment section.

127

answered Oct 18 '22 22:10

scharette

You could use groupby() and max() to help here:

from itertools import groupby

with open('toy.txt') as f_input:
    for key, group in groupby(f_input, lambda x: x[:2]):
        print(max(group, key=lambda x: len(x)).strip())

This would display:

ABCDEFGHIJKLMNO
CEST
DBTSFDEO
EOEUDNBNUW
EAEUDNBNUW
FGH

groupby() works by returning a list of matching items based on a function, in this case consecutive lines with the same first 2 characters. The max() function then takes this list and returns the list item with the longest length.

answered Oct 18 '22 21:10

Martin Evans

Related questions
                            
                                Flutter variable with const keyword [duplicate]
                            
                                How to remove old notification channels?
                            
                                Declaring string type with min/max length in typescript
                            
                                How do I use Microsoft.jQuery.Unobtrusive.Ajax with libman (Library manager) asp.net Core 2.1?
                            
                                Feature extraction and take color histogram
                            
                                Why Logging doesn't use string interpolation
                            
                                Debugging Cypress tests in Visual Studio Code
                            
                                Can't access service in my local kubernetes cluster using NodePort
                            
                                Send data to a TemplateRef MatDialog
                            
                                Deserializing Elasticsearch Results via JSON.NET
                            
                                Avoiding Android navigation IllegalArgumentException in NavController [duplicate]
                            
                                Uncaught TypeError: Cannot destructure property `name` of 'undefined' or 'null'

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With