Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python 2.7 - find and replace from text file, using dictionary, to new text file

I am newbie to programming, and have been studying python in my spare time for the past few months. I decided I was going to try and create a little script that converts American spellings to English spellings in a text file.

I have been trying all sorts of things for the past 5 hours, but eventually came up with something that got me somewhat closer to my goal, but not quite there!

#imported dictionary contains 1800 english:american spelling key:value pairs. 
from english_american_dictionary import dict


def replace_all(text, dict):
    for english, american in dict.iteritems():
        text = text.replace(american, english)
    return text


my_text = open('test_file.txt', 'r')

for line in my_text:
    new_line = replace_all(line, dict)
    output = open('output_test_file.txt', 'a')
    print >> output, new_line

output.close()

I am sure there is a considerably better way to go about things, but for this script,here are the issues I am having:

  • In the output file the lines are written on every other line, with a line break between, but the original test_file.txt does not have this. Contents of test_file.txt shown at bottom of this page
  • Only the first instance of an American spelling in a line gets converted to English.
  • I didn't really want to open output file in append mode, but couldn't figure out 'r' in this code structure.

Any help appreciated for this eager newb!

The contents of the test_file.txt are:

I am sample file.
I contain an english spelling: colour.
3 american spellings on 1 line: color, analyze, utilize.
1 american spelling on 1 line: familiarize.
like image 884
Darren Haynes Avatar asked Sep 17 '13 03:09

Darren Haynes


1 Answers

The extra blank line you are seeing is because you are using print to write out a line that already includes a newline character at the end. Since print writes its own newline too, your output becomes double spaced. An easy fix is to use outfile.write(new_line) instead.

As for the file modes, the issue is that you're opening the output file over and over. You should just open it once, at the start. Its usually a good idea to use with statements to handle opening files, since they'll take care of closing them for you when you're done with them.

I don't undestand your other issue, with only some of the replacements happening. Is your dictionary missing the spellings for 'analyze' and 'utilize'?

One suggestion I'd make is to not do your replacements line by line. You can read the whole file in at once with file.read() and then work on it as a single unit. This will probably be faster, since it won't need to loop as often over the items in your spelling dictionary (just once, rather than once per line):

with open('test_file.txt', 'r') as in_file:
    text = in_file.read()

with open('output_test_file.txt', 'w') as out_file:
    out_file.write(replace_all(text, spelling_dict))

Edit:

To make your code correctly handle words that contain other words (like "entire" containing "tire"), you probably need to abandon the simple str.replace approach in favor of regular expressions.

Here's a quickly thrown together solution that uses re.sub, given a dictionary of spelling changes from American to British English (that is, in the reverse order of your current dictionary):

import re

#from english_american_dictionary import ame_to_bre_spellings
ame_to_bre_spellings = {'tire':'tyre', 'color':'colour', 'utilize':'utilise'}

def replacer_factory(spelling_dict):
    def replacer(match):
        word = match.group()
        return spelling_dict.get(word, word)
    return replacer

def ame_to_bre(text):
    pattern = r'\b\w+\b'  # this pattern matches whole words only
    replacer = replacer_factory(ame_to_bre_spellings)
    return re.sub(pattern, replacer, text)

def main():
    #with open('test_file.txt') as in_file:
    #    text = in_file.read()
    text = 'foo color, entire, utilize'

    #with open('output_test_file.txt', 'w') as out_file:
    #    out_file.write(ame_to_bre(text))
    print(ame_to_bre(text))

if __name__ == '__main__':
    main()

One nice thing about this code structure is that you can easily convert from British English spellings back to American English ones, if you pass a dictionary in the other order to the replacer_factory function.

like image 95
Blckknght Avatar answered Sep 28 '22 08:09

Blckknght