Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Capturing posessives and prefixes with Python regex

Tags:

python

regex

I'm trying to write a regex for Python to capture the various forms of "archipelago" that appear in a corpus.

Here is a test string:

This is my sentence about islands, archipelagos, and archipelagic spaces. I want to make sure that the archipelago's cat is not forgotten. And we cannot forget the meta-archipelagic and protoarchipelagic historians, who tend to spell the plural 'archipelagoes.'

I want to capture the following from the string:

archipelagos
archipelagic
archipelago's
meta-archipelagic
protoarchipelagic
archipelagoes

Attempt 1

Using the regex (archipelag.*?)\b and testing with Pythex, I capture portions of all six forms. But there are problems:

  1. archipelago's is captured only as archipelago. I want to get the possessive.
  2. meta-archipelagic is captured only as archipelagic. I want to be able to capture hyphenated prefixes.
  3. protoarchipelagic is captured only as archipelagic. I want to be able to capture non-hyphenated prefixes.

Attempt 2

If I try the regex (archipelag.*?)\s (see Pythex), the possessive archipelago's is now captured, but the comma that follows the first instance is captured as well (e.g., archipelagos,). It fails to capture the final 'archipelagoes.' altogether.

like image 229
Brian Croxall Avatar asked Nov 24 '25 20:11

Brian Croxall


1 Answers

The regular expression ((?:\b\w+\b-)?\b\w*archipelag\w*\b(?:'s)?) works for this. If you have other requirements, you may want to modify it further.

Note the use of non-capturing groups (?:) to group expressions so that we can then match zero or one of them using ?

import re

pat = re.compile(r"((?:\b\w+\b-)?\b\w*archipelag\w*\b(?:'s)?)")

corpus = "This is my sentence about islands, archipelagos, and archipelagic spaces. I want to make sure that the archipelago's cat is not forgotten. And we cannot forget the meta-archipelagic and protoarchipelagic historians, who tend to spell the plural 'archipelagoes.'"

for match in pat.findall(corpus):
    print(match)

prints

archipelagos
archipelagic
archipelago's
meta-archipelagic
protoarchipelagic
archipelagoes

Here it is on regex101

like image 56
Patrick Haugh Avatar answered Nov 27 '25 09:11

Patrick Haugh



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!