Given long list of lines in a text file I only want to return the substring that immediately precedes, for example, the word dog (the word that describes dog). So for example say there are these lines containing dog:
“hotdog” “big dog” “is dogged” “dog spy” “with my dog” “brown dogs”
In this case the desired result is only “big” “my” “brown”
I have used this python script
import re
with open('titles_500subset.txt') as searchfile:
for line in searchfile:
line = line.lower()
d = re.search('(.+?) eye', line)
if d:
found = d.group(1)
print found
This would return “with my”
and “big”
So here I wouldn't get “brown”
and I get all the terms “with my”
How do I specify just one word before dog (obviously I cannot put a space before (.+?)
as then I would exclude “big”
and “brown”
as they are at the start of a line)?
How can I specify just one character to come after dog e.g. “s”
to get only words before dogs and dog but not dogged?
And in a perfect case I’d also like to be able to specify results to exclude, e.g. “my”
.
Many thanks
Just split the lines into an array by spaces and then you can find dog in the array and print the element before it.
with open('titles_500subset.txt') as searchfile:
for line in searchfile:
words = line.lower().split()
if 'dog' in words[1:]:
print words[words.index('dog')-1]
That requires a bit more if you want it to detect multiple dogs per line but it's a simpler set up to grab certain words if spaces are all that's important to you.
Also the way I've done this turns the entire document lowercase, so you'd need to add extra checks for that if you don't want it to work that way.
I changed the if condition to check if it finds an index of 'Dog' greater than zero, so it can effectively check if dog exists and make sure it's not at the start of the sentence in one go. (If it finds dog at zero, it then looks for the preceding word at -1, which means it takes the last word from that line, which is undesired behaviour)
If you want to check multiple key words:
keywords = ["dog", "dogs"]
with open('titles_500subset.txt') as searchfile:
for line in searchfile:
words = line.lower().split()
for key in keywords:
if key in words[1:]:
print words[words.index(key)-1]
Just add any words you might want to search into the keywords list.
You can run a regex on the whole text instead of running it on each line. Try this:
import re
with open('titles_500subset.txt') as searchfile:
text = searchfile.read()
d = re.findall('([^ \r\n]+) dogs?([\r\n]| |$)', text, re.IGNORECASE)
for result in d:
print result[0]
RegEx explanation:
([^ \r\n]+)
Find something that is not a space or newline character (one or more characters)dog
followed by "dog"s?
followed by an optional "s"([\r\n]| |$)
select from 3 alternatives: a newline character, a space, or the end of the fileIf you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With