Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to extract headings in text file using regex in python?

I have always used stackoverflow for solving many of my problems by searching the threads. Today I would like some guidance on creating a regex pattern for my text files. My files have headings that are varied in nature and do not follow the same naming pattern. The pattern they do follow somewhat is like this:

2.0 DESCRIPTION
3.0 PLACE OF PERFORMANCE
5.0 SERVICES RETAINED
6.0        STRUCTURE AND ROLES
etc....

It always follows a number and then capital letters or number and then spaces and then capital letters. The output I need is a list :

output = ['2.0 DESCRIPTION','3.0 PLACE OF PERFORMANCE','5.0 SERVICES RETAINED','6.0        STRUCTURE AND ROLES']

I am extremely new to python and regex. I tried the following but it did not give me the output desired:

import re

text = f'''2.0 DESCRIPTION 
some text here

3.0 SERVICES
som text

5.0 SERVICES RETAINED
some text

6.0        STRUCTURE AND ROLES
sometext'''

pattern = r"\d\s[A-Z][A-Z]+"
matches = re.findall(pattern,text)

But it returned:

['0 DESCRIPTION', '0 SERVICES', '0 SERVICES']

Not the output that I was looking for. Your guidance in finding a pattern will be really appreciated.

Cheers, Abhishek

like image 621
Abhishek Avatar asked Mar 16 '26 14:03

Abhishek


1 Answers

You may use

matches = re.findall(r'^\d+(?:\.\d+)* *[A-Z][A-Z ]*$',text, re.M)

See the regex demo.

Here,

  • ^ - start of a line (re.M redefines ^ behavior to include these positions, too)
  • \d+(?:\.\d+)* - 1+ digits and then 0+ sequences of a . and 1+ digits
  • * - zero or more spaces
  • [A-Z][A-Z ]* - an uppercase letter and then 0 or more uppercase letters or spaces
  • $ - end of a line.
like image 171
Wiktor Stribiżew Avatar answered Mar 18 '26 02:03

Wiktor Stribiżew



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!