I'm a lawyer and python beginner, so I'm both (a) dumb and (b) completely out of my lane.
I'm trying to apply a regex pattern to a text file. The pattern can sometimes stretch across multiple lines. I'm specifically interested in these lines from the text file:
Considered and decided by Hemingway, Presiding Judge; Bell,
Judge; and \n
\n
Dickinson, Emily, Judge.
I'd like to individually hunt for, extract, and then print the judges' names. My code so far looks like this:
import re
def judges():
presiding = re.compile(r'by\s*?([A-Z].*),\s*?Presiding\s*?Judge;', re.DOTALL)
judge2 = re.compile(r'Presiding\s*?Judge;\s*?([A-Z].*),\s*?Judge;', re.DOTALL)
judge3 = re.compile(r'([A-Z].*), Judge\.', re.DOTALL)
with open("text.txt", "r") as case:
for lines in case:
presiding_match = re.search(presiding, lines)
judge2_match = re.search(judge2, lines)
judge3_match = re.search(judge3, lines)
if presiding_match or judge2_match or judge3_match:
print(presiding_match.group(1))
print(judge2_match.group(1))
print(judge3_match.group(1))
break
When I run it, I can get Hemingway and Bell, but then I get an "AttributeError: 'NoneType' object has no attribute 'group'" for the third judge after the two line breaks.
After trial-and-error, I've found that my code is only reading the first line (until the "Bell, Judge; and") then quits. I thought the re.DOTALL would solve it, but I can't seem to make it work.
I've tried a million ways to capture the line breaks and get the whole thing, including re.match, re.DOTALL, re.MULTILINE, "".join, "".join(lines.strip()), and anything else I can throw against the wall to make stick.
After a couple days, I've bowed to asking for help. Thanks for anything you can do.
(As an aside, I've had no luck getting the regex to work with the ^ and $ characters. It also seems to hate the . escape in the judge3 regex.)
You are passing in single lines, because you are iterating over the open file referenced by case
. The regex is never passed anything other than a single line of text. Your regexes can each match some of the lines, but they don't all together match the same single line.
You'd have to read in more than one line. If the file is small enough, just read it as one string:
with open("text.txt", "r") as case:
case_text = case.read()
then apply your regular expressions to that one string.
Or, you could test each of the match objects individually, not as a group, and only print those that matched:
if presiding_match:
print(presiding_match.group(1))
elif judge2_match:
print(judge2_match.group(1))
elif judge3_match:
print(judge3_match.group(1))
but then you'll have to create additional logic to determine when you are done reading from the file and break out of the loop.
Note that the patterns you are matching are not broken across lines, so the DOTALL
flag is not actually needed here. You do match .*
text, so you are running the risk of matching too much if you use DOTALL
:
>>> import re
>>> case_text = """Considered and decided by Hemingway, Presiding Judge; Bell, Judge; and
...
... Dickinson, Emily, Judge.
... """
>>> presiding = re.compile(r'by\s*?([A-Z].*),\s*?Presiding\s*?Judge;', re.DOTALL)
>>> judge2 = re.compile(r'Presiding\s*?Judge;\s*?([A-Z].*),\s*?Judge;', re.DOTALL)
>>> judge3 = re.compile(r'([A-Z].*), Judge\.', re.DOTALL)
>>> presiding.search(case_text).groups()
('Hemingway',)
>>> judge2.search(case_text).groups()
('Bell',)
>>> judge3.search(case_text).groups()
('Considered and decided by Hemingway, Presiding Judge; Bell, Judge; and \n\nDickinson, Emily',)
I'd at least replace [A-Z].*
with [A-Z][^;\n]+
, to at least exclude matching ;
semicolons and newlines, and only match names at least 2 characters long. Just drop the DOTALL
flags altogether:
>>> presiding = re.compile(r'by\s*?([A-Z][^;]+),\s+?Presiding\s+?Judge;')
>>> judge2 = re.compile(r'Presiding\s+?Judge;\s+?([A-Z][^;]+),\s+?Judge;')
>>> judge3 = re.compile(r'([A-Z][^;]+), Judge\.')
>>> presiding.search(case_text).groups()
('Hemingway',)
>>> judge2.search(case_text).groups()
('Bell',)
>>> judge3.search(case_text).groups()
('Dickinson, Emily',)
You can combine the three patterns into one:
judges = re.compile(
r'(?:Considered\s+?and\s+?decided\s+?by\s+?)?'
r'([A-Z][^;]+),\s+?(?:Presiding\s+?)?Judge[.;]'
)
which can find all the judges in your input in one go with .findall()
:
>>> judges.findall(case_text)
['Hemingway', 'Bell', 'Dickinson, Emily']
Assuming you can read the file all at once (ie the file is not too big). You can extract judge information as follows:
import re
regex = re.compile(
r'decided\s+by\s+(?P<presiding_judge>[A-Za-z]+)\s*,\s+Presiding\s+Judge;'
r'\s+(?P<judge>[A-Za-z]+)\s*,\s+Judge;'
r'\s+and\s+(?P<extra_judges>[A-Za-z,\s]+)\s*,\s+Judge\.?',
re.DOTALL | re.MULTILINE
)
filename = 'text.txt'
with open(filename) as fd:
data = fd.read()
for match in regex.finditer(data):
print(match.groupdict())
with sample input text file (text.txt
) looking like this, the output becomes:
{'judge': 'Bell', 'extra_judges': 'Dickinson, Emily', 'presiding_judge': 'Hemingway'}
{'judge': 'Abel', 'extra_judges': 'Lagrange, Gauss', 'presiding_judge': 'Einstein'}
{'judge': 'Dirichlet', 'extra_judges': 'Fourier, Cauchy', 'presiding_judge': 'Newton'}
You can also play with this at regex101 site
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With