Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to search for regex pattern across multiple lines of text with re.DOTALL?

Tags:

I'm a lawyer and python beginner, so I'm both (a) dumb and (b) completely out of my lane.

I'm trying to apply a regex pattern to a text file. The pattern can sometimes stretch across multiple lines. I'm specifically interested in these lines from the text file:

Considered  and  decided  by  Hemingway,  Presiding  Judge;  Bell, 
Judge;  and \n
 \n
Dickinson, Emily, Judge.

I'd like to individually hunt for, extract, and then print the judges' names. My code so far looks like this:

import re
def judges():
    presiding = re.compile(r'by\s*?([A-Z].*),\s*?Presiding\s*?Judge;', re.DOTALL)
    judge2 = re.compile(r'Presiding\s*?Judge;\s*?([A-Z].*),\s*?Judge;', re.DOTALL)
    judge3 = re.compile(r'([A-Z].*), Judge\.', re.DOTALL)
    with open("text.txt", "r") as case:
        for lines in case:
            presiding_match = re.search(presiding, lines)
            judge2_match = re.search(judge2, lines)
            judge3_match = re.search(judge3, lines)
            if presiding_match or judge2_match or judge3_match:
                print(presiding_match.group(1))
                print(judge2_match.group(1))
                print(judge3_match.group(1))
                break

When I run it, I can get Hemingway and Bell, but then I get an "AttributeError: 'NoneType' object has no attribute 'group'" for the third judge after the two line breaks.

After trial-and-error, I've found that my code is only reading the first line (until the "Bell, Judge; and") then quits. I thought the re.DOTALL would solve it, but I can't seem to make it work.

I've tried a million ways to capture the line breaks and get the whole thing, including re.match, re.DOTALL, re.MULTILINE, "".join, "".join(lines.strip()), and anything else I can throw against the wall to make stick.

After a couple days, I've bowed to asking for help. Thanks for anything you can do.

(As an aside, I've had no luck getting the regex to work with the ^ and $ characters. It also seems to hate the . escape in the judge3 regex.)

like image 892
chekhov's_gin Avatar asked Dec 29 '18 19:12

chekhov's_gin


2 Answers

You are passing in single lines, because you are iterating over the open file referenced by case. The regex is never passed anything other than a single line of text. Your regexes can each match some of the lines, but they don't all together match the same single line.

You'd have to read in more than one line. If the file is small enough, just read it as one string:

with open("text.txt", "r") as case:
    case_text = case.read()

then apply your regular expressions to that one string.

Or, you could test each of the match objects individually, not as a group, and only print those that matched:

if presiding_match:
    print(presiding_match.group(1))
elif judge2_match:
    print(judge2_match.group(1))
elif judge3_match:
    print(judge3_match.group(1))

but then you'll have to create additional logic to determine when you are done reading from the file and break out of the loop.

Note that the patterns you are matching are not broken across lines, so the DOTALL flag is not actually needed here. You do match .* text, so you are running the risk of matching too much if you use DOTALL:

>>> import re
>>> case_text = """Considered  and  decided  by  Hemingway,  Presiding  Judge;  Bell, Judge;  and
...
... Dickinson, Emily, Judge.
... """
>>> presiding = re.compile(r'by\s*?([A-Z].*),\s*?Presiding\s*?Judge;', re.DOTALL)
>>> judge2 = re.compile(r'Presiding\s*?Judge;\s*?([A-Z].*),\s*?Judge;', re.DOTALL)
>>> judge3 = re.compile(r'([A-Z].*), Judge\.', re.DOTALL)
>>> presiding.search(case_text).groups()
('Hemingway',)
>>> judge2.search(case_text).groups()
('Bell',)
>>> judge3.search(case_text).groups()
('Considered  and  decided  by  Hemingway,  Presiding  Judge;  Bell, Judge;  and \n\nDickinson, Emily',)

I'd at least replace [A-Z].* with [A-Z][^;\n]+, to at least exclude matching ; semicolons and newlines, and only match names at least 2 characters long. Just drop the DOTALL flags altogether:

>>> presiding = re.compile(r'by\s*?([A-Z][^;]+),\s+?Presiding\s+?Judge;')
>>> judge2 = re.compile(r'Presiding\s+?Judge;\s+?([A-Z][^;]+),\s+?Judge;')
>>> judge3 = re.compile(r'([A-Z][^;]+), Judge\.')
>>> presiding.search(case_text).groups()
('Hemingway',)
>>> judge2.search(case_text).groups()
('Bell',)
>>> judge3.search(case_text).groups()
('Dickinson, Emily',)

You can combine the three patterns into one:

judges = re.compile(
    r'(?:Considered\s+?and\s+?decided\s+?by\s+?)?'
    r'([A-Z][^;]+),\s+?(?:Presiding\s+?)?Judge[.;]'
)

which can find all the judges in your input in one go with .findall():

>>> judges.findall(case_text)
['Hemingway', 'Bell', 'Dickinson, Emily']
like image 145
Martijn Pieters Avatar answered Oct 21 '22 17:10

Martijn Pieters


Assuming you can read the file all at once (ie the file is not too big). You can extract judge information as follows:

import re

regex = re.compile(
    r'decided\s+by\s+(?P<presiding_judge>[A-Za-z]+)\s*,\s+Presiding\s+Judge;'
    r'\s+(?P<judge>[A-Za-z]+)\s*,\s+Judge;'
    r'\s+and\s+(?P<extra_judges>[A-Za-z,\s]+)\s*,\s+Judge\.?',
    re.DOTALL | re.MULTILINE
)

filename = 'text.txt'
with open(filename) as fd:
    data = fd.read()

for match in regex.finditer(data):
    print(match.groupdict())

with sample input text file (text.txt) looking like this, the output becomes:

{'judge': 'Bell', 'extra_judges': 'Dickinson, Emily', 'presiding_judge': 'Hemingway'}
{'judge': 'Abel', 'extra_judges': 'Lagrange, Gauss', 'presiding_judge': 'Einstein'}
{'judge': 'Dirichlet', 'extra_judges': 'Fourier, Cauchy', 'presiding_judge': 'Newton'}

You can also play with this at regex101 site

like image 39
dopstar Avatar answered Oct 21 '22 19:10

dopstar