I'm trying to write a regex for Python to capture the various forms of "archipelago" that appear in a corpus.
Here is a test string:
This is my sentence about islands, archipelagos, and archipelagic spaces. I want to make sure that the archipelago's cat is not forgotten. And we cannot forget the meta-archipelagic and protoarchipelagic historians, who tend to spell the plural 'archipelagoes.'
I want to capture the following from the string:
archipelagos
archipelagic
archipelago's
meta-archipelagic
protoarchipelagic
archipelagoes
Using the regex (archipelag.*?)\b and testing with Pythex, I capture portions of all six forms. But there are problems:
archipelago's is captured only as archipelago. I want to get the possessive.meta-archipelagic is captured only as archipelagic. I want to be able to capture hyphenated prefixes.protoarchipelagic is captured only as archipelagic. I want to be able to capture non-hyphenated prefixes.If I try the regex (archipelag.*?)\s (see Pythex), the possessive archipelago's is now captured, but the comma that follows the first instance is captured as well (e.g., archipelagos,). It fails to capture the final 'archipelagoes.' altogether.
The regular expression ((?:\b\w+\b-)?\b\w*archipelag\w*\b(?:'s)?) works for this. If you have other requirements, you may want to modify it further.
Note the use of non-capturing groups (?:) to group expressions so that we can then match zero or one of them using ?
import re
pat = re.compile(r"((?:\b\w+\b-)?\b\w*archipelag\w*\b(?:'s)?)")
corpus = "This is my sentence about islands, archipelagos, and archipelagic spaces. I want to make sure that the archipelago's cat is not forgotten. And we cannot forget the meta-archipelagic and protoarchipelagic historians, who tend to spell the plural 'archipelagoes.'"
for match in pat.findall(corpus):
print(match)
prints
archipelagos
archipelagic
archipelago's
meta-archipelagic
protoarchipelagic
archipelagoes
Here it is on regex101
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With