How to extract headings in text file using regex in python?

Question

I have always used stackoverflow for solving many of my problems by searching the threads. Today I would like some guidance on creating a regex pattern for my text files. My files have headings that are varied in nature and do not follow the same naming pattern. The pattern they do follow somewhat is like this:

2.0 DESCRIPTION
3.0 PLACE OF PERFORMANCE
5.0 SERVICES RETAINED
6.0        STRUCTURE AND ROLES
etc....

It always follows a number and then capital letters or number and then spaces and then capital letters. The output I need is a list :

output = ['2.0 DESCRIPTION','3.0 PLACE OF PERFORMANCE','5.0 SERVICES RETAINED','6.0        STRUCTURE AND ROLES']

I am extremely new to python and regex. I tried the following but it did not give me the output desired:

import re

text = f'''2.0 DESCRIPTION 
some text here

3.0 SERVICES
som text

5.0 SERVICES RETAINED
some text

6.0        STRUCTURE AND ROLES
sometext'''

pattern = r"\d\s[A-Z][A-Z]+"
matches = re.findall(pattern,text)

But it returned:

['0 DESCRIPTION', '0 SERVICES', '0 SERVICES']

Not the output that I was looking for. Your guidance in finding a pattern will be really appreciated.

Cheers, Abhishek

Wiktor Stribiżew · Accepted Answer

You may use

matches = re.findall(r'^\d+(?:\.\d+)* *[A-Z][A-Z ]*$',text, re.M)

See the regex demo.

Here,

^ - start of a line (re.M redefines ^ behavior to include these positions, too)
\d+(?:\.\d+)* - 1+ digits and then 0+ sequences of a . and 1+ digits
* - zero or more spaces
[A-Z][A-Z ]* - an uppercase letter and then 0 or more uppercase letters or spaces
$ - end of a line.

How to extract headings in text file using regex in python?

Tags:

regex

python-3.x

Abhishek

1 Answers

Wiktor Stribiżew

Recent Activity

Donate For Us

How to extract headings in text file using regex in python?

Tags:

regex

python-3.x

Abhishek

1 Answers

Wiktor Stribiżew

Related questions

Recent Activity

Donate For Us