python beginner - faster way to find and replace in large file?

Tags:

I have a file of about 100 million lines in which I want to replace text with alternate text stored in a tab-delimited file. The code that I have works, but is taking about an hour to process the first 70K lines.In trying to incrementally advance my python skills, I am wondering whether there is a faster way to do this. Thanks! The input file looks something like this:

CHROMOSOME_IV ncRNA gene 5723085 5723105 . - . ID=Gene:WBGene00045518 CHROMOSOME_IV ncRNA ncRNA 5723085 5723105 . - . Parent=Gene:WBGene00045518

and the file with replacement values looks like this:

WBGene00045518 21ur-5153

Here is my code:

infile1 = open('f1.txt', 'r')
infile2 = open('f2.txt', 'r')
outfile = open('out.txt', 'w')

import re
from datetime import datetime
startTime = datetime.now()

udict = {}
for line in infile1:
    line = line.strip()
    linelist = line.split('\t')
    udict1 = {linelist[0]:linelist[1]} 
    udict.update(udict1)

mult10K = []
for x in range(100):
    mult10K.append(x * 10000)   
linecounter = 0
for line in infile2:
    for key, value in udict.items():
        matches = line.count(key)
        if matches > 0: 
            print key, value
            line = line.replace(key, value)
            outfile.write(line + '\n')
        else:
            outfile.write(line + '\n')
    linecounter += 1
    if linecounter in mult10K:
        print linecounter   
        print (datetime.now()-startTime)
infile1.close()
infile2.close()
outfile.close()

368

asked Apr 20 '12 16:04

pandaSeq

1 Answers

You should split your lines into "words" and only look up these words in your dictionary:

>>> re.findall(r"\w+", "CHROMOSOME_IV ncRNA gene 5723085 5723105 . - . ID=Gene:WBGene00045518 CHROMOSOME_IV ncRNA ncRNA 5723085 5723105 . - . Parent=Gene:WBGene00045518")
['CHROMOSOME_IV', 'ncRNA', 'gene', '5723085', '5723105', 'ID', 'Gene', 'WBGene00045518', 'CHROMOSOME_IV', 'ncRNA', 'ncRNA', '5723085', '5723105', 'Parent', 'Gene', 'WBGene00045518']

This will eliminate the loop over the dictionary you do for every single line.

Here' the complete code:

import re

with open("f1.txt", "r") as infile1:
    udict = dict(line.strip().split("\t", 1) for line in infile1)

with open("f2.txt", "r") as infile2, open("out.txt", "w") as outfile:
    for line in infile2:
        for word in re.findall(r"\w+", line):
            if word in udict:
                line = line.replace(word, udict[word])
        outfile.write(line)

Edit: An alternative approach is to build a single mega-regex from your dictionary:

with open("f1.txt", "r") as infile1:
    udict = dict(line.strip().split("\t", 1) for line in infile1)
regex = re.compile("|".join(map(re.escape, udict)))
with open("f2.txt", "r") as infile2, open("out.txt", "w") as outfile:
    for line in infile2:
        outfile.write(regex.sub(lambda m: udict[m.group()], line))

130

answered Oct 21 '22 05:10

Sven Marnach

Related questions
                            
                                Proper capitalization for titles in Python [closed]
                            
                                python ttk treeview: how to select and set focus on a row?
                            
                                importing a module in Idle shell
                            
                                Install Python Module in local install of web2py
                            
                                Predicting Values with k-Means Clustering Algorithm
                            
                                In laymans terms, what does the Python string format "g" actually mean?
                            
                                Django password reset email subject
                            
                                Linear X Logarithmic scale
                            
                                How can I filter the imagefield by filename in django
                            
                                Temporary directory persist across program runs
                            
                                Python Proxy Error With Requests Library
                            
                                How to unpack only some arguments from zip, not all?
                            
                                Why isn't the 'insert' function adding rows using MySQLdb?
                            
                                django-tables2 specify different properties for different rows
                            
                                Invalid Django form
                            
                                Is there a method to get the get parent canvas for axes in matplotlib?
                            
                                Embedded Python 2.7.2 Importing a module from a user-defined directory
                            
                                Is it possible to take an ordered "slice" of a dictionary in Python based on a list of keys?
                            
                                Using JSON keys as attributes in nested JSON
                            
                                Translate a table to a hierarchical dictionary?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

python beginner - faster way to find and replace in large file?

Tags:

python

replace

pandaSeq

People also ask

1 Answers

Sven Marnach

Recent Activity

Donate For Us