I want to print the contents of a file to the terminal and in the process highlight any words that are found in a list without modifying the original file. Here's an example of the not-yet-working code: <pre class="prettyprint lang-py prettyprint-override"><code> def highlight_story(self): """Print a line from a file and highlight words in a list.""" the_file = open(self.filename, 'r') file_contents = the_file.read() for word in highlight_terms: regex = re.compile( r'\b' # Word boundary. + word # Each item in the list. + r's{0,1}', # One optional 's' at the end. flags=re.IGNORECASE | re.VERBOSE) subst = '\033[1;41m' + r'\g<0>' + '\033[0m' result = re.sub(regex, subst, file_contents) print result the_file.close() highlight_terms = [ 'dog', 'hedgehog', 'grue' ] </code></pre> As it is, only the last item in the list, regardless of what it is or how long the list is, will be highlighted. I assume that each substitution is performed and then "forgotten" when the next iteration begins. It looks something like this: <blockquote> <kbd>Grues</kbd> have been known to eat both human and non-human animals. In poorly-lit areas dogs and hedgehogs are considered by any affluent <kbd>grue</kbd> to a be delicacies. Dogs can frighten awat a <kbd>grue</kbd>, however, by barking in a musical scale. A hedgehog, on the other hand, must simply resign itself to its fate of becoming a hotdog fit for a <kbd>grue</kbd> king. </blockquote> But it should look like this: <blockquote> <kbd>Grues</kbd> have been known to eat both human and non-human animals. In poorly-lit areas <kbd>dogs</kbd> and <kbd>hedgehogs</kbd> are considered by any affluent <kbd>grue</kbd> to a be delicacies. <kbd>Dogs</kbd> can frighten away a <kbd>grue</kbd>, however, by barking in a musical scale. A <kbd>hedgehog</kbd>, on the other hand, must simply resign itself to its fate of becoming a hotdog fit for a <kbd>grue</kbd> king. </blockquote> How can I stop the other substitutions from being lost?

You can modify your regex to the following: <pre class="prettyprint"><code>regex = re.compile(r'\b('+'|'.join(highlight_terms)+r')s?', flags=re.IGNORECASE | re.VERBOSE) # note the ? instead of {0, 1}. It has the same effect </code></pre> Then, you won't need the <code>for</code> loop. This code takes the list of words and then concatenates them together with a <code>|</code>. So if your list was something like: <pre class="prettyprint"><code>a = ['cat', 'dog', 'mouse']; </code></pre> The regex would be: <pre class="prettyprint"><code>\b(cat|dog|mouse)s? </code></pre>

The regex provided is correct, but the for loop is where you got wrong. <pre class="prettyprint"><code>result = re.sub(regex, subst, file_contents) </code></pre> This line substitutes the <code>regex</code> with <code>subst</code> in the <code>file_content</code>. in the second iteration, it again does the substitution in <code>file_content</code> where as you intented to do it on <code>result</code> How to correct result = file_contents <pre class="prettyprint"><code>for word in highlight_terms: regex = re.compile( r'\b' # Word boundary. + word # Each item in the list. + r's?\b', # One optional 's' at the end. flags=re.IGNORECASE | re.VERBOSE) print regex.pattern subst = '\033[1;41m' + r'\g<0>' + '\033[0m' result = re.sub(regex, subst, result) #change made here print result </code></pre>

you need to reassign <code>file_contents</code> each time through the loop to the replaced string, reassigning <code>file_contents</code> does not change the content in the file: <pre class="prettyprint"><code>def highlight_story(self): """Print a line from a file and highlight words in a list.""" the_file = open(self.filename, 'r') file_contents = the_file.read() output = "" for word in highlight_terms: regex = re.compile( r'\b' # Word boundary. + word # Each item in the list. + r's{0,1}', # One optional 's' at the end. flags=re.IGNORECASE | re.VERBOSE) subst = '\033[1;41m' + r'\g<0>' + '\033[0m' file_contents = re.sub(regex, subst, file_contents) # reassign to updatedvalue print file_contents the_file.close() </code></pre> Also using with to open files is a better way to go and you can make a copy of the string outside the loop and update inside: <pre class="prettyprint"><code>def highlight_story(self): """Print a line from a file and highlight words in a list.""" with open(self.filename) as the_file: file_contents = the_file.read() output = file_contents # copy for word in highlight_terms: regex = re.compile( r'\b' # Word boundary. + word # Each item in the list. + r's{0,1}', # One optional 's' at the end. flags=re.IGNORECASE | re.VERBOSE) subst = '\033[1;41m' + r'\g<0>' + '\033[0m' output = re.sub(regex, subst, output) # update copy print output the_file.close() </code></pre>

Finding and substituting a list of words in a file using regex in Python

Tags:

python

regex

python-2.7

I want to print the contents of a file to the terminal and in the process highlight any words that are found in a list without modifying the original file. Here's an example of the not-yet-working code:

    def highlight_story(self):
        """Print a line from a file and highlight words in a list."""

        the_file = open(self.filename, 'r')
        file_contents = the_file.read()

        for word in highlight_terms:
            regex = re.compile(
                  r'\b'      # Word boundary.
                + word       # Each item in the list.
                + r's{0,1}', # One optional 's' at the end.
                flags=re.IGNORECASE | re.VERBOSE)
            subst = '\033[1;41m' + r'\g<0>' + '\033[0m'
            result = re.sub(regex, subst, file_contents)

        print result
        the_file.close()

highlight_terms = [
    'dog',
    'hedgehog',
    'grue'
]

As it is, only the last item in the list, regardless of what it is or how long the list is, will be highlighted. I assume that each substitution is performed and then "forgotten" when the next iteration begins. It looks something like this:

Grues have been known to eat both human and non-human animals. In poorly-lit areas dogs and hedgehogs are considered by any affluent grue to a be delicacies. Dogs can frighten awat a grue, however, by barking in a musical scale. A hedgehog, on the other hand, must simply resign itself to its fate of becoming a hotdog fit for a grue king.

But it should look like this:

Grues have been known to eat both human and non-human animals. In poorly-lit areas dogs and hedgehogs are considered by any affluent grue to a be delicacies. Dogs can frighten away a grue, however, by barking in a musical scale. A hedgehog, on the other hand, must simply resign itself to its fate of becoming a hotdog fit for a grue king.

How can I stop the other substitutions from being lost?

251

asked Nov 08 '14 19:11

Christopher Perry

3 Answers

You can modify your regex to the following:

regex = re.compile(r'\b('+'|'.join(highlight_terms)+r')s?', flags=re.IGNORECASE | re.VERBOSE)  # note the ? instead of {0, 1}. It has the same effect

Then, you won't need the for loop.

This code takes the list of words and then concatenates them together with a |. So if your list was something like:

a = ['cat', 'dog', 'mouse'];

The regex would be:

\b(cat|dog|mouse)s?

164

answered Oct 19 '22 00:10

sshashank124

The regex provided is correct, but the for loop is where you got wrong.

result = re.sub(regex, subst, file_contents)

This line substitutes the regex with subst in the file_content.

in the second iteration, it again does the substitution in file_content where as you intented to do it on result

How to correct

result = file_contents

for word in highlight_terms:
    regex = re.compile(
          r'\b'      # Word boundary.
        + word       # Each item in the list.
        + r's?\b', # One optional 's' at the end.
        flags=re.IGNORECASE | re.VERBOSE)
    print regex.pattern
    subst = '\033[1;41m' + r'\g<0>' + '\033[0m'
    result = re.sub(regex, subst, result) #change made here

 print result

answered Oct 18 '22 23:10

nu11p01n73R

you need to reassign file_contents each time through the loop to the replaced string, reassigning file_contents does not change the content in the file:

def highlight_story(self):
        """Print a line from a file and highlight words in a list."""

        the_file = open(self.filename, 'r')
        file_contents = the_file.read()
        output = ""
        for word in highlight_terms:
            regex = re.compile(
                  r'\b'      # Word boundary.
                + word       # Each item in the list.
                + r's{0,1}', # One optional 's' at the end.
                flags=re.IGNORECASE | re.VERBOSE)
            subst = '\033[1;41m' + r'\g<0>' + '\033[0m'
            file_contents  = re.sub(regex, subst, file_contents) # reassign to updatedvalue
        print file_contents
        the_file.close()

Also using with to open files is a better way to go and you can make a copy of the string outside the loop and update inside:

def highlight_story(self):
    """Print a line from a file and highlight words in a list."""
    with open(self.filename) as the_file:
        file_contents = the_file.read()
        output = file_contents # copy
        for word in highlight_terms:
            regex = re.compile(
                r'\b'  # Word boundary.
                + word  # Each item in the list.
                + r's{0,1}',  # One optional 's' at the end.
                flags=re.IGNORECASE | re.VERBOSE)
            subst = '\033[1;41m' + r'\g<0>' + '\033[0m'
            output = re.sub(regex, subst, output) # update copy
        print output
    the_file.close()

answered Oct 19 '22 00:10

Padraic Cunningham

Related questions
                            
                                Timer cannot restart after it is being stopped in Python
                            
                                How can I use the automatically created implicit through model class in Django in a ForeignKey field?
                            
                                Celery-Django as Daemon: Settings not found
                            
                                Django Ajax Submission with validation and multiple forms handling
                            
                                correct way of using os.path.join() in python
                            
                                Get a list of values from a list of dictionaries?
                            
                                Change default arguments of function in python
                            
                                Why is __len__() called implicitly on a custom iterator
                            
                                Tkinter after_cancel in python
                            
                                Move and zoom a tkinter canvas with mouse
                            
                                Element-wise matrix multiplication in NumPy
                            
                                QQuickView only supports loading of root objects that derive from QQuickItem error?
                            
                                Installing python server for emacs-jedi
                            
                                How to have a percentage chance of a command to run
                            
                                Importing a CSV file in pandas into a pandas dataframe
                            
                                Can I supply a URL to lxml.etree.parse on Python 3?
                            
                                GDB pretty printing ImportError: No module named 'printers'
                            
                                How to import and run a django function at the command line
                            
                                How to see logging output in embedded python interpreter?
                            
                                Use sqlalchemy to select only one row from related table

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With