Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python string replace in a file without touching the file if no substitution was made

What does Python's string.replace return if no string substitution was made? Does Python's file.open(f, 'w') always touch the file even if no changes were made?

Using Python, I'm trying to replace occurrences of 'oldtext' with 'newtext' in a set of files. If a file contains 'oldtext', I want to do the replacement and save the file. Otherwise, do nothing, so the file maintains its old timestamp.

The following code works fine, except all files get written, even if no string substitution was made, and all files have a new timestamp.

for match in all_files('*.html', '.'):  # all_files returns all html files in current directory     
  thefile = open(match)
  content = thefile.read()              # read entire file into memory
  thefile.close()
  thefile = open(match, 'w')             
  thefile.write(content.replace(oldtext, newtext))  # write the file with the text substitution
  thefile.close()

In this code I'm trying to do the file.write only if a string substitution occurred, but still, all the files get a new timestamp:

count = 0
for match in all_files('*.html', '.'):       # all_files returns all html files in current directory
    thefile = open(match)
    content = thefile.read()                 # read entire file into memory
    thefile.close()
    thefile = open(match, 'w')
    replacedText = content.replace(oldtext, newtext) 
    if replacedText != '':
        count += 1
        thefile.write(replacedText)
    thefile.close()
print (count)        # print the number of files that we modified

At the end, count is the total number of files, not the number of files modified. Any suggestions? Thanks.

I'm using Python 3.1.2 on Windows.

like image 692
LandedGently Avatar asked Mar 12 '11 22:03

LandedGently


3 Answers

What does Python's string.replace return if no string substitution was made?

It returns the original string.

Does Python's file.open(f, 'w') always touch the file even if no changes were made?

More than merely touching the file, it destroys any content f used to contain.

So, you can test if the file needs to be rewritten with if replacedText != content, and only open the file in write mode if this is the case:

count = 0
for match in all_files('*.html', '.'):       # all_files returns all html files in current directory
    with open(match) as thefile:
        content = thefile.read()                 # read entire file into memory
        replacedText = content.replace(oldtext, newtext)
    if replacedText!=content:
        with open(match, 'w') as thefile:
            count += 1
            thefile.write(replacedText)
print (count)        # print the number of files that we modified
like image 68
unutbu Avatar answered Oct 06 '22 02:10

unutbu


What does Python's string.replace return if no string substitution was made?

str.replace() returns the string itself or a copy if the object is a subclass of string.

Does Python's file.open(f, 'w') always touch the file even if no changes were made?

open(f, 'w') opens and truncates the file f.

Note the code below is CPython specific; it won't work correctly on pypy, jython:

count = 0
for match in all_files('*.html', '.'):
    content = open(match).read()
    replacedText = content.replace(oldtext, newtext) 
    if replacedText is not content:
       count += 1
       open(match, 'w').write(replacedText)
print (count)   
like image 22
jfs Avatar answered Oct 06 '22 00:10

jfs


Your case is a particular case: 'newtext' has exactly the same number of characters as 'oldtext'.

Hence, it is possible to use one of the following codes to replace exactly a word 'oldtext' or a line in which the word 'oldtext' is present, by word 'newtext' or a line in which 'newtext' replaces 'oldtext'.

.

If files have not super-big size, the content of each file can be read entirely into memory:

from os import fsync      # code using find()

count = 0
for match in all_files('*.html', '.'):
    with open(match,'rb+') as thefile:
        diag = False
        fno = thefile.fileno()
        content = thefile.read()
        thefile.seek(0,0)
        x = content.find('oldtext')
        while x>=0:
            diag = True
            thefile.seek(x,1)
            thefile.write('newtext')
            thefile.flush()
            fsync(fno)
            x = content[thefile.tell():].find('oldtext')
    if diag:
        cnt += 1

or

from os import fsync     # code using a regex
import re
pat = re.compile('oldtext')

count = 0
for match in all_files('*.html', '.'):
    with open(match,'rb+') as thefile:
        diag = False
        fno = thefile.fileno()
        content = thefile.read()
        thefile.seek(0,0)
        prec = 0
        for mat in pat.finditer(content):
            diag = True
            thefile.seek(mat.start()-prec,1)
            thefile.write('newtext')
            thefile.flush()
            fsync(fno)
            prec = mat.end()
    if diag:
        cnt += 1

.

For heavy files, a reading and rewriting line after line is possible:

from os import fsync   # code for big files, using regex
import re
pat = re.compile('oldtext')

count = 0
for match in all_files('*.html', '.'):
    with open(match,'rb+') as thefile:
        diag = False
        fno = thefile.fileno()
        line = thefile.readline()
        while line:
            if 'oldtext' in line:
                diag = True
                thefile.seek(-len(line),1)
                thefile.write(pat.sub('newtext',line))
                thefile.flush()
                fsync(fno) 
            line = thefile.readline()
    if diag:
        cnt += 1

.

The instructions thefile.flush() and fsync(fno) are required after each writing in order that the file handler thefile points with accuracy on the exact position in the file at any moment. They allow to obtain effective writing ordered by instuction write()

flush() does not necessarily write the file’s data to disk. Use flush() followed by os.fsync() to ensure this behavior. http://docs.python.org/library/stdtypes.html#file.flush

.

These programs do the minimum. So I think they are fast.

.

Nota bene : a file opened in mode 'rb+' have no changing of its time of last modification if no modification has been performed.

like image 36
eyquem Avatar answered Oct 06 '22 01:10

eyquem