Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python: Using Regex to Capture Sub-Patterns within a Pattern

Tags:

python

regex

Disclaimer: This is my first post. Feel free to give me feedback and how I should or shouldn't have formatted this question. Thanks!

I'm looking to pull out data from text blocks by capturing anything that matches a pattern of a date format followed by a colon. I have successfully used regular expressions to capture information including an observation date, a colon, and any text that follows up to the period before the next date.

For example:
1999-01-01: 10 birds observed.

The problem that I am having is that some of my data contains site names followed by a colon within the observation data that follows that observation date and first colon. This sub-pattern of 'sitename: data' could occur zero or many times within the block following the observation date.

For example:
1999-01-01: BS-001: 5 birds observed. All in good health. BS-002: 5 birds observed, some in poor health.

What pattern should I use to capture all text after the date format and colon, including the potential site names, their colons, and related data up to the period before the next observation date?

I currently extract the simple observation data (without multiple sites within them) by date and observation using the following pattern:

pattern = re.compile(r'(\d\d\d\d\-*\s*\&*\d+\-*\d*:[A-Za-z0-9\s\,\(\)\;\"\-]*\.*)')  

The code above lets me pull out observation dates that could be in a variety of forms. Using periods as part of the pattern is tricky since observation data could be one or many sentences.

Here is an example of the text I am trying to search and split out. Each new match should begin with an observation date, so in the data below there should be 3 matches returned (2013-04-13: data, 2017-01-01: data, and 2018-07-04: data):

2013-04-13: BS-440: 10 egg masses observed in vernal pool habitat. Observer noted 3 of the AMJE masses had firm jelly, akin to a 3-wk old AMMA mass, but "bumpier" on outside (membrane and embryo-spacing in the masses were AMJE-like). BS-443: 3 egg masses observed in vernal pool habitat. A few egg masses may have been missed due to poor light conditions. Smith-019: 250 egg masses observed in vernal pool habitat. Observer searched only portions abutting the road (SW margin of pool). Many AMJE masses observed attached to herbaceous vegetation and difficult to differentiate from one another. AMJE egg-mass count is a rough estimate within area searched. 2017-01-01: 23 individuals observed. Egg masses were not present. 2018-07-04: BS-440: All individuals took a break from breeding for the long holiday weekend.

Ideally the output would look like this:

2013-04-13: BS-440: 10 egg masses observed in vernal pool habitat. Observer noted 3 of the AMJE masses had firm jelly, akin to a 3-wk old AMMA mass, but "bumpier" on outside (membrane and embryo-spacing in the masses were AMJE-like). BS-443: 3 egg masses observed in vernal pool habitat. A few egg masses may have been missed due to poor light conditions. Smith-019: 250 egg masses observed in vernal pool habitat. Observer searched only portions abutting the road (SW margin of pool). Many AMJE masses observed attached to herbaceous vegetation and difficult to differentiate from one another. AMJE egg-mass count is a rough estimate within area searched.

2017-01-01: 23 individuals observed. Egg masses were not present.

2018-07-04: BS-440: All individuals took a break from breeding for the long holiday weekend.

like image 754
MrChancey Avatar asked May 24 '26 09:05

MrChancey


2 Answers

Basically, it sounds like you want to separate your text into fields that start with a date and end just before a date or the end of the text. Here's one possibility:

\d{4}-\d\d-\d\d:           # date with colon
.*?                        # the minimal amount of any characters required to match
(?=                        # positive lookahead (match text but don't consume it)
   \d{4}-\d\d-\d\d:        # date with colon
  |                        # or
   $                       # end of text
)                          # end lookahead

Use it in conjunction with re.findall():

findall(r'\d{4}-\d\d-\d\d:.*?(?=\d{4}-\d\d-\d\d:|$)', mytext)

Run against your sample text above:

['2013-04-13: BS-440: 10 egg masses observed in vernal pool habitat.
  Observer noted 3 of the AMJE masses had firm jelly, akin to a 3-wk
  old AMMA mass, but "bumpier" on outside (membrane and embryo-spacing
  in the masses were AMJE-like). BS-443: 3 egg masses observed in
  vernal pool habitat. A few egg masses may have been missed due to
  poor light conditions. Smith-019: 250 egg masses observed in
  vernal pool habitat. Observer searched only portions abutting the 
  road (SW margin of pool). Many AMJE masses observed attached
  to herbaceous vegetation and difficult to differentiate from
  one another. AMJE egg-mass count is a rough estimate within
  area searched. ',
 '2017-01-01: 23 individuals observed. Egg masses were not present. ',
 '2018-07-04: BS-440: All individuals took a break from breeding for
  the long holiday weekend.']
like image 97
glibdud Avatar answered May 26 '26 22:05

glibdud


You can try a replacement of all white-spaces followed by a date with two newline characters:

s = re.sub(r'\s+(?=\d{4}-*\s*&*\d+-*\d*:)', "\n\n", s)

This way you don't match the first date at the beginning of the string.

If you are unsure each date is preceded by whitespaces, you can also write it like this:

s = re.sub(r'\s*(?!^)(?=\d{4}-*\s*&*\d+-*\d*:)', "\n\n", s)
like image 38
Casimir et Hippolyte Avatar answered May 26 '26 22:05

Casimir et Hippolyte