I want to print the contents of a file to the terminal and in the process highlight any words that are found in a list without modifying the original file. Here's an example of the not-yet-working code:
def highlight_story(self):
"""Print a line from a file and highlight words in a list."""
the_file = open(self.filename, 'r')
file_contents = the_file.read()
for word in highlight_terms:
regex = re.compile(
r'\b' # Word boundary.
+ word # Each item in the list.
+ r's{0,1}', # One optional 's' at the end.
flags=re.IGNORECASE | re.VERBOSE)
subst = '\033[1;41m' + r'\g<0>' + '\033[0m'
result = re.sub(regex, subst, file_contents)
print result
the_file.close()
highlight_terms = [
'dog',
'hedgehog',
'grue'
]
As it is, only the last item in the list, regardless of what it is or how long the list is, will be highlighted. I assume that each substitution is performed and then "forgotten" when the next iteration begins. It looks something like this:
Grues have been known to eat both human and non-human animals. In poorly-lit areas dogs and hedgehogs are considered by any affluent grue to a be delicacies. Dogs can frighten awat a grue, however, by barking in a musical scale. A hedgehog, on the other hand, must simply resign itself to its fate of becoming a hotdog fit for a grue king.
But it should look like this:
Grues have been known to eat both human and non-human animals. In poorly-lit areas dogs and hedgehogs are considered by any affluent grue to a be delicacies. Dogs can frighten away a grue, however, by barking in a musical scale. A hedgehog, on the other hand, must simply resign itself to its fate of becoming a hotdog fit for a grue king.
How can I stop the other substitutions from being lost?
sub() method will replace all pattern occurrences in the target string.
To perform a substitution, you use the Replace method of the Regex class, instead of the Match method that we've seen in earlier articles. This method is similar to Match, except that it includes an extra string parameter to receive the replacement value.
You can modify your regex to the following:
regex = re.compile(r'\b('+'|'.join(highlight_terms)+r')s?', flags=re.IGNORECASE | re.VERBOSE) # note the ? instead of {0, 1}. It has the same effect
Then, you won't need the for
loop.
This code takes the list of words and then concatenates them together with a |
. So if your list was something like:
a = ['cat', 'dog', 'mouse'];
The regex would be:
\b(cat|dog|mouse)s?
The regex provided is correct, but the for loop is where you got wrong.
result = re.sub(regex, subst, file_contents)
This line substitutes the regex
with subst
in the file_content
.
in the second iteration, it again does the substitution in file_content
where as you intented to do it on result
How to correct
result = file_contents
for word in highlight_terms:
regex = re.compile(
r'\b' # Word boundary.
+ word # Each item in the list.
+ r's?\b', # One optional 's' at the end.
flags=re.IGNORECASE | re.VERBOSE)
print regex.pattern
subst = '\033[1;41m' + r'\g<0>' + '\033[0m'
result = re.sub(regex, subst, result) #change made here
print result
you need to reassign file_contents
each time through the loop to the replaced string, reassigning file_contents
does not change the content in the file:
def highlight_story(self):
"""Print a line from a file and highlight words in a list."""
the_file = open(self.filename, 'r')
file_contents = the_file.read()
output = ""
for word in highlight_terms:
regex = re.compile(
r'\b' # Word boundary.
+ word # Each item in the list.
+ r's{0,1}', # One optional 's' at the end.
flags=re.IGNORECASE | re.VERBOSE)
subst = '\033[1;41m' + r'\g<0>' + '\033[0m'
file_contents = re.sub(regex, subst, file_contents) # reassign to updatedvalue
print file_contents
the_file.close()
Also using with to open files is a better way to go and you can make a copy of the string outside the loop and update inside:
def highlight_story(self):
"""Print a line from a file and highlight words in a list."""
with open(self.filename) as the_file:
file_contents = the_file.read()
output = file_contents # copy
for word in highlight_terms:
regex = re.compile(
r'\b' # Word boundary.
+ word # Each item in the list.
+ r's{0,1}', # One optional 's' at the end.
flags=re.IGNORECASE | re.VERBOSE)
subst = '\033[1;41m' + r'\g<0>' + '\033[0m'
output = re.sub(regex, subst, output) # update copy
print output
the_file.close()
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With