Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

python beginner - faster way to find and replace in large file?

Tags:

python

replace

I have a file of about 100 million lines in which I want to replace text with alternate text stored in a tab-delimited file. The code that I have works, but is taking about an hour to process the first 70K lines.In trying to incrementally advance my python skills, I am wondering whether there is a faster way to do this. Thanks! The input file looks something like this:

CHROMOSOME_IV ncRNA gene 5723085 5723105 . - . ID=Gene:WBGene00045518 CHROMOSOME_IV ncRNA ncRNA 5723085 5723105 . - . Parent=Gene:WBGene00045518

and the file with replacement values looks like this:

WBGene00045518 21ur-5153

Here is my code:

infile1 = open('f1.txt', 'r')
infile2 = open('f2.txt', 'r')
outfile = open('out.txt', 'w')

import re
from datetime import datetime
startTime = datetime.now()

udict = {}
for line in infile1:
    line = line.strip()
    linelist = line.split('\t')
    udict1 = {linelist[0]:linelist[1]} 
    udict.update(udict1)

mult10K = []
for x in range(100):
    mult10K.append(x * 10000)   
linecounter = 0
for line in infile2:
    for key, value in udict.items():
        matches = line.count(key)
        if matches > 0: 
            print key, value
            line = line.replace(key, value)
            outfile.write(line + '\n')
        else:
            outfile.write(line + '\n')
    linecounter += 1
    if linecounter in mult10K:
        print linecounter   
        print (datetime.now()-startTime)
infile1.close()
infile2.close()
outfile.close()
like image 368
pandaSeq Avatar asked Apr 20 '12 16:04

pandaSeq


People also ask

How do you process a large file in Python?

Reading Large Text Files in Python We can use the file object as an iterator. The iterator will return each line one by one, which can be processed. This will not read the whole file into memory and it's suitable to read large files in Python.

How do you find and replace value in Python?

Python String | replace() The replace() in Python returns a copy of the string where all occurrences of a substring are replaced with another substring.


1 Answers

You should split your lines into "words" and only look up these words in your dictionary:

>>> re.findall(r"\w+", "CHROMOSOME_IV ncRNA gene 5723085 5723105 . - . ID=Gene:WBGene00045518 CHROMOSOME_IV ncRNA ncRNA 5723085 5723105 . - . Parent=Gene:WBGene00045518")
['CHROMOSOME_IV', 'ncRNA', 'gene', '5723085', '5723105', 'ID', 'Gene', 'WBGene00045518', 'CHROMOSOME_IV', 'ncRNA', 'ncRNA', '5723085', '5723105', 'Parent', 'Gene', 'WBGene00045518']

This will eliminate the loop over the dictionary you do for every single line.

Here' the complete code:

import re

with open("f1.txt", "r") as infile1:
    udict = dict(line.strip().split("\t", 1) for line in infile1)

with open("f2.txt", "r") as infile2, open("out.txt", "w") as outfile:
    for line in infile2:
        for word in re.findall(r"\w+", line):
            if word in udict:
                line = line.replace(word, udict[word])
        outfile.write(line)

Edit: An alternative approach is to build a single mega-regex from your dictionary:

with open("f1.txt", "r") as infile1:
    udict = dict(line.strip().split("\t", 1) for line in infile1)
regex = re.compile("|".join(map(re.escape, udict)))
with open("f2.txt", "r") as infile2, open("out.txt", "w") as outfile:
    for line in infile2:
        outfile.write(regex.sub(lambda m: udict[m.group()], line))
like image 130
Sven Marnach Avatar answered Oct 21 '22 05:10

Sven Marnach