Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Finding and substituting a list of words in a file using regex in Python

I want to print the contents of a file to the terminal and in the process highlight any words that are found in a list without modifying the original file. Here's an example of the not-yet-working code:

    def highlight_story(self):
        """Print a line from a file and highlight words in a list."""

        the_file = open(self.filename, 'r')
        file_contents = the_file.read()

        for word in highlight_terms:
            regex = re.compile(
                  r'\b'      # Word boundary.
                + word       # Each item in the list.
                + r's{0,1}', # One optional 's' at the end.
                flags=re.IGNORECASE | re.VERBOSE)
            subst = '\033[1;41m' + r'\g<0>' + '\033[0m'
            result = re.sub(regex, subst, file_contents)

        print result
        the_file.close()

highlight_terms = [
    'dog',
    'hedgehog',
    'grue'
]

As it is, only the last item in the list, regardless of what it is or how long the list is, will be highlighted. I assume that each substitution is performed and then "forgotten" when the next iteration begins. It looks something like this:

Grues have been known to eat both human and non-human animals. In poorly-lit areas dogs and hedgehogs are considered by any affluent grue to a be delicacies. Dogs can frighten awat a grue, however, by barking in a musical scale. A hedgehog, on the other hand, must simply resign itself to its fate of becoming a hotdog fit for a grue king.

But it should look like this:

Grues have been known to eat both human and non-human animals. In poorly-lit areas dogs and hedgehogs are considered by any affluent grue to a be delicacies. Dogs can frighten away a grue, however, by barking in a musical scale. A hedgehog, on the other hand, must simply resign itself to its fate of becoming a hotdog fit for a grue king.

How can I stop the other substitutions from being lost?

like image 251
Christopher Perry Avatar asked Nov 08 '14 19:11

Christopher Perry


People also ask

How do you replace all occurrences of a regex pattern in a string Python?

sub() method will replace all pattern occurrences in the target string.

How do you substitute in regex?

To perform a substitution, you use the Replace method of the Regex class, instead of the Match method that we've seen in earlier articles. This method is similar to Match, except that it includes an extra string parameter to receive the replacement value.


3 Answers

You can modify your regex to the following:

regex = re.compile(r'\b('+'|'.join(highlight_terms)+r')s?', flags=re.IGNORECASE | re.VERBOSE)  # note the ? instead of {0, 1}. It has the same effect

Then, you won't need the for loop.

This code takes the list of words and then concatenates them together with a |. So if your list was something like:

a = ['cat', 'dog', 'mouse'];

The regex would be:

\b(cat|dog|mouse)s?
like image 164
sshashank124 Avatar answered Oct 19 '22 00:10

sshashank124


The regex provided is correct, but the for loop is where you got wrong.

result = re.sub(regex, subst, file_contents)

This line substitutes the regex with subst in the file_content.

in the second iteration, it again does the substitution in file_content where as you intented to do it on result

How to correct

result = file_contents

for word in highlight_terms:
    regex = re.compile(
          r'\b'      # Word boundary.
        + word       # Each item in the list.
        + r's?\b', # One optional 's' at the end.
        flags=re.IGNORECASE | re.VERBOSE)
    print regex.pattern
    subst = '\033[1;41m' + r'\g<0>' + '\033[0m'
    result = re.sub(regex, subst, result) #change made here

 print result
like image 24
nu11p01n73R Avatar answered Oct 18 '22 23:10

nu11p01n73R


you need to reassign file_contents each time through the loop to the replaced string, reassigning file_contents does not change the content in the file:

def highlight_story(self):
        """Print a line from a file and highlight words in a list."""

        the_file = open(self.filename, 'r')
        file_contents = the_file.read()
        output = ""
        for word in highlight_terms:
            regex = re.compile(
                  r'\b'      # Word boundary.
                + word       # Each item in the list.
                + r's{0,1}', # One optional 's' at the end.
                flags=re.IGNORECASE | re.VERBOSE)
            subst = '\033[1;41m' + r'\g<0>' + '\033[0m'
            file_contents  = re.sub(regex, subst, file_contents) # reassign to updatedvalue
        print file_contents
        the_file.close()

Also using with to open files is a better way to go and you can make a copy of the string outside the loop and update inside:

def highlight_story(self):
    """Print a line from a file and highlight words in a list."""
    with open(self.filename) as the_file:
        file_contents = the_file.read()
        output = file_contents # copy
        for word in highlight_terms:
            regex = re.compile(
                r'\b'  # Word boundary.
                + word  # Each item in the list.
                + r's{0,1}',  # One optional 's' at the end.
                flags=re.IGNORECASE | re.VERBOSE)
            subst = '\033[1;41m' + r'\g<0>' + '\033[0m'
            output = re.sub(regex, subst, output) # update copy
        print output
    the_file.close()
like image 29
Padraic Cunningham Avatar answered Oct 19 '22 00:10

Padraic Cunningham