Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python RE. excluding some results

Tags:

python

regex

I'm new to RE and I'm trying to take song lyrics and isolate the verse titles, the backing vocals, and main vocals:

Here's an example of some lyrics:

[Intro]
D.A. got that dope!

[Chorus: Travis Scott]
Ice water, turned Atlantic (Freeze)
Nightcrawlin' in the Phantom (Skrrt, Skrrt)...

The verse titles include the square brackets and any words between them. They can be successfully isolated with

r'\[{1}.*?\]{1}'

The backing vocals are similar to the verse titles, but between (). They've been successfully isolated with:

r'\({1}.*?\){1}'

For the main vocals, I've used

r'\S+'

which does isolate the main_vocals, but also the verse titles and backing vocals. I cannot figure out how to isolate only the main vocals with simple REs.

Here's a python script that gets the output I desire, but I'd like to do it with REs (as a learning exercise) and cannot figure it out through documentation.

import re

file = 'D:/lyrics.txt'
with open(file, 'r') as f:
    lyrics = f.read()

def find_spans(pattern, string):
    pattern = re.compile(pattern)
    return [match.span() for match in pattern.finditer(string)]

verses = find_spans(r'\[{1}.*?\]{1}', lyrics)
backing_vocals = find_spans(r'\({1}.*?\){1}', lyrics)
main_vocals = find_spans(r'\S+', lyrics)

exclude = verses
exclude.extend(backing_vocals)

not_main_vocals = []
for span in exclude:
    start, stop = span
    not_main_vocals.extend(list(range(start, stop)))

main_vocals_temp = []
for span in main_vocals:
    append = True
    start, stop = span
    for i in range(start, stop):
        if i in not_main_vocals: 
            append = False
            continue
    if append == True: 
        main_vocals_temp.append(span)
main_vocals = main_vocals_temp
like image 587
Osuynonma Avatar asked Dec 03 '25 03:12

Osuynonma


1 Answers

Try this Demo:

pattern = r'(?P<Verse>\[[^\]]+])|(?P<Backing>\([^\)]+\))|(?P<Lyrics>[^\[\(]+)'

You can use re.finditer to isolate the groups.

breakdown = {k: [] for k in ('Verse', 'Backing', 'Lyrics')}
for p in pattern.finditer(song):
    for key, item in p.groupdict().items():
        if item: breakdown[key].append(item)

Result:

{
  'Verse': 
    [
      '[Intro]', 
      '[Chorus: Travis Scott]'
    ], 
  'Backing': 
    [
      '(Freeze)', 
      '(Skrrt, Skrrt)'
    ], 
  'Lyrics': 
    [
      '\nD.A. got that dope!\n\n', 
      '\nIce water, turned Atlantic ', 
      "\nNightcrawlin' in the Phantom ", 
      '...'
    ]
}

To elaborate a bit further on the pattern, it's using the named groups to separate the three distinct groups. Using [^\]+] and similar just means to find everything that is not ] (and likewise when \) means everything not )). In the Lyrics part we exclude anything that starts with [ and (. The link to the demo on regex101 would explain the components in more details if you need.

If you don't care for the newlines in the main lyrics, use (?P<Lyrics>[^\[\(\n]+) (which excludes the \n) to turn your Lyrics without newlines:

'Lyrics': [
  'D.A. got that dope!', 
  'Ice water, turned Atlantic ',
  "Nightcrawlin' in the Phantom ", 
  '...'
]
like image 154
r.ook Avatar answered Dec 05 '25 18:12

r.ook



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!