I have a file of about 100 million lines in which I want to replace text with alternate text stored in a tab-delimited file. The code that I have works, but is taking about an hour to process the first 70K lines.In trying to incrementally advance my python skills, I am wondering whether there is a faster way to do this. Thanks! The input file looks something like this:
CHROMOSOME_IV ncRNA gene 5723085 5723105 . - . ID=Gene:WBGene00045518 CHROMOSOME_IV ncRNA ncRNA 5723085 5723105 . - . Parent=Gene:WBGene00045518
and the file with replacement values looks like this:
WBGene00045518 21ur-5153
Here is my code:
infile1 = open('f1.txt', 'r')
infile2 = open('f2.txt', 'r')
outfile = open('out.txt', 'w')
import re
from datetime import datetime
startTime = datetime.now()
udict = {}
for line in infile1:
line = line.strip()
linelist = line.split('\t')
udict1 = {linelist[0]:linelist[1]}
udict.update(udict1)
mult10K = []
for x in range(100):
mult10K.append(x * 10000)
linecounter = 0
for line in infile2:
for key, value in udict.items():
matches = line.count(key)
if matches > 0:
print key, value
line = line.replace(key, value)
outfile.write(line + '\n')
else:
outfile.write(line + '\n')
linecounter += 1
if linecounter in mult10K:
print linecounter
print (datetime.now()-startTime)
infile1.close()
infile2.close()
outfile.close()
Reading Large Text Files in Python We can use the file object as an iterator. The iterator will return each line one by one, which can be processed. This will not read the whole file into memory and it's suitable to read large files in Python.
Python String | replace() The replace() in Python returns a copy of the string where all occurrences of a substring are replaced with another substring.
You should split your lines into "words" and only look up these words in your dictionary:
>>> re.findall(r"\w+", "CHROMOSOME_IV ncRNA gene 5723085 5723105 . - . ID=Gene:WBGene00045518 CHROMOSOME_IV ncRNA ncRNA 5723085 5723105 . - . Parent=Gene:WBGene00045518")
['CHROMOSOME_IV', 'ncRNA', 'gene', '5723085', '5723105', 'ID', 'Gene', 'WBGene00045518', 'CHROMOSOME_IV', 'ncRNA', 'ncRNA', '5723085', '5723105', 'Parent', 'Gene', 'WBGene00045518']
This will eliminate the loop over the dictionary you do for every single line.
Here' the complete code:
import re
with open("f1.txt", "r") as infile1:
udict = dict(line.strip().split("\t", 1) for line in infile1)
with open("f2.txt", "r") as infile2, open("out.txt", "w") as outfile:
for line in infile2:
for word in re.findall(r"\w+", line):
if word in udict:
line = line.replace(word, udict[word])
outfile.write(line)
Edit: An alternative approach is to build a single mega-regex from your dictionary:
with open("f1.txt", "r") as infile1:
udict = dict(line.strip().split("\t", 1) for line in infile1)
regex = re.compile("|".join(map(re.escape, udict)))
with open("f2.txt", "r") as infile2, open("out.txt", "w") as outfile:
for line in infile2:
outfile.write(regex.sub(lambda m: udict[m.group()], line))
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With