Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Replace a list of characters with indices in a string in python

Tags:

python

I have a list of coordinates:

coordinates = [[1,5], [10,15], [25, 35]]

I have a string as follows:

line = 'ATCACGTGTGTGTACACGTACGTGTGNGTNGTTGAGTGKWSGTGAAAAAKCT'

I want to replace intervals indicated in pairs in coordinates as start and end with character 'N'.

The only way I can think of is the following:

for element in coordinates:
    length = element[1] - element[0]
    line = line.replace(line[element[0]:element[1]], 'N'*length)

The desired output would be:

line = 'ANNNNGTGTGNNNNNACGTACGTGTNNNNNNNNNNGTGKWSGTGAAAAAKCT'

where intervals, [1,5), [10,15) and [25, 35) are replaced with N in line.

This requires me to loop through the coordinate list and update my string line, every time. I was wondering if there is another way that one can replace a list of intervals in a string?

Note: There is a problem with the original solution in this question. In line.replace(line[element[0]:element[1]], 'N'*length), replace will replace all other instances of string identical to the one in line[element[0]:element[1]] from the sequence and for people working with DNA, this is definitely not what you want! I however, keep the solution as it is to not disturb the flow of comments and discussion following.

like image 938
Homap Avatar asked Jul 30 '20 09:07

Homap


1 Answers

Instead of string concatenation (wich is wasteful due to created / destroyed string instances), use a list:

coordinates = [[1,5], [10,15], [25, 35]] # sorted

line = 'ATCACGTGTGTGTACACGTACGTGTGNGTNGTTGAGTGKWSGTGAAAAAKCT'

result = list(line)
# opted for exclusive end pos
for r in [range(start,end) for start,end in coordinates]:
    for p in r:
        result[p]='N'

res = ''.join(result)
print(res)

To get:

ANNNNGTGTGNNNNNACGTACGTGTNNNNNNNNNNGTGKWSGTGAAAAAKCT

optimized to use slicing and exclusive end:

for start,end in coordinates:
    result[start:end] = ["N"]*(end-start)

res = ''.join(result)
print(line)
print(res)

gives you your wanted output:

ATCACGTGTGTGTACACGTACGTGTGNGTNGTTGAGTGKWSGTGAAAAAKCT 
ANNNNGTGTGNNNNNACGTACGTGTNNNNNNNNNNGTGKWSGTGAAAAAKCT
like image 184
Patrick Artner Avatar answered Sep 28 '22 09:09

Patrick Artner