Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

In python, how do I search for only one word before my selected substring

Given long list of lines in a text file I only want to return the substring that immediately precedes, for example, the word dog (the word that describes dog). So for example say there are these lines containing dog:

“hotdog” “big dog” “is dogged” “dog spy” “with my dog” “brown dogs”

In this case the desired result is only “big” “my” “brown”

I have used this python script

import re
with open('titles_500subset.txt') as searchfile:
    for line in searchfile:
        line = line.lower()
        d = re.search('(.+?) eye', line)
        if d:
            found = d.group(1)
            print found

This would return “with my” and “big”

So here I wouldn't get “brown” and I get all the terms “with my”

How do I specify just one word before dog (obviously I cannot put a space before (.+?) as then I would exclude “big” and “brown” as they are at the start of a line)?

How can I specify just one character to come after dog e.g. “s” to get only words before dogs and dog but not dogged?

And in a perfect case I’d also like to be able to specify results to exclude, e.g. “my”.

Many thanks

like image 535
Betty Avatar asked Dec 19 '22 04:12

Betty


2 Answers

Just split the lines into an array by spaces and then you can find dog in the array and print the element before it.

with open('titles_500subset.txt') as searchfile:
    for line in searchfile:
        words = line.lower().split()
        if 'dog' in words[1:]:
            print words[words.index('dog')-1]

That requires a bit more if you want it to detect multiple dogs per line but it's a simpler set up to grab certain words if spaces are all that's important to you.

Also the way I've done this turns the entire document lowercase, so you'd need to add extra checks for that if you don't want it to work that way.

I changed the if condition to check if it finds an index of 'Dog' greater than zero, so it can effectively check if dog exists and make sure it's not at the start of the sentence in one go. (If it finds dog at zero, it then looks for the preceding word at -1, which means it takes the last word from that line, which is undesired behaviour)

If you want to check multiple key words:

keywords = ["dog", "dogs"]
with open('titles_500subset.txt') as searchfile:
    for line in searchfile:
        words = line.lower().split()
        for key in keywords:
            if key in words[1:]:
                print words[words.index(key)-1]

Just add any words you might want to search into the keywords list.

like image 188
SuperBiasedMan Avatar answered Dec 21 '22 23:12

SuperBiasedMan


You can run a regex on the whole text instead of running it on each line. Try this:

import re
with open('titles_500subset.txt') as searchfile:
    text = searchfile.read()
    d = re.findall('([^ \r\n]+) dogs?([\r\n]| |$)', text, re.IGNORECASE)
    for result in d:
            print result[0]

RegEx explanation:

  • ([^ \r\n]+) Find something that is not a space or newline character (one or more characters)
  • followed by a space character
  • dog followed by "dog"
  • s? followed by an optional "s"
  • ([\r\n]| |$) select from 3 alternatives: a newline character, a space, or the end of the file
like image 33
Raphael Müller Avatar answered Dec 22 '22 01:12

Raphael Müller