Python regex to capture a comma-delimited list of items

Question

I have a list of weather forecasts that start with a similar prefix that I'd like to remove. I'd also like to capture the city names:

Some Examples:

If you have vacation or wedding plans in Phoenix, Tucson, Flagstaff, Salt Lake City, Park City, Denver, Estes Park, Colorado Springs, Pueblo, or Albuquerque, the week will...

If you have vacation or wedding plans for Miami, Jacksonville, Macon, Charlotte, or Charleston, expect a couple systems...

If you have vacation or wedding plans in Pittsburgh, Philadelphia, Atlantic City, Newark, Baltimore, D.C., Richmond, Charleston, or Dover, expect the week...

The strings start with a common prefix "If you have vacation or wedding plans in" and the last city has "or" before it. The list of cities is of variable length.

I've tried this:

>>> text = 'If you have vacation or wedding plans in NYC, Boston, Manchester, Concord, Providence, or Portland'
>>> re.search(r'^If you have vacation or wedding plans in ((\b\w+\b), ?)+ or (\w+)', text).groups()
('Providence,', 'Providence', 'Portland')
>>>

I think I'm pretty close, but obviously it's not working. I've never tried to do something with a variable number of captured items; any guidance would be greatly appreciated.

alecxe · Accepted Answer

Alternative solution here (probably just for sharing and educational purposes).

If you were to solve it with nltk, it would be called a Named Entity Recognition problem. Using the snippet based on nltk.chunk.ne_chunk_sents(), provided here:

import nltk


def extract_entity_names(t):
    entity_names = []

    if hasattr(t, 'label') and t.label:
        if t.label() == 'NE':
            entity_names.append(' '.join([child[0] for child in t]))
        else:
            for child in t:
                entity_names.extend(extract_entity_names(child))

    return entity_names


sample = "If you have vacation or wedding plans in Phoenix, Tucson, Flagstaff, Salt Lake City, Park City, Denver, Estes Park, Colorado Springs, Pueblo, or Albuquerque, the week will..."

sentences = nltk.sent_tokenize(sample)
tokenized_sentences = [nltk.word_tokenize(sentence) for sentence in sentences]
tagged_sentences = [nltk.pos_tag(sentence) for sentence in tokenized_sentences]
chunked_sentences = nltk.ne_chunk_sents(tagged_sentences, binary=True)

entity_names = []
for tree in chunked_sentences:
    entity_names.extend(extract_entity_names(tree))

print entity_names

It prints exactly the desired result:

['Phoenix', 'Tucson', 'Flagstaff', 'Salt Lake City', 'Park City', 'Denver', 'Estes Park', 'Colorado Springs', 'Pueblo', 'Albuquerque']

Hai Vu · Answer

Here is my approach: use the csv module to parse the lines (I assume they are in a text file named data.csv, please change to suite your situation). After parsing each line:

Discard the last cell, it is not a city name
Remove 'If ...' from the first cell
Remove or 'or ' from the last cell (used to be next-to-last)

Here is the code:

import csv


def cleanup(row):
    new_row = row[:-1]
    new_row[0] = new_row[0].replace('If you have vacation or wedding plans in ', '')
    new_row[0] = new_row[0].replace('If you have vacation or wedding plans for ', '')
    new_row[-1] = new_row[-1].replace('or ', '')
    return new_row

if __name__ == '__main__':
    with open('data.csv') as f:
        reader = csv.reader(f, skipinitialspace=True)
        for row in reader:
            row = cleanup(row)
            print row

Output:

['Phoenix', 'Tucson', 'Flagstaff', 'Salt Lake City', 'Park City', 'Denver', 'Estes Park', 'Colorado Springs', 'Pueblo', 'Albuquerque']
['Miami', 'Jacksonville', 'Macon', 'Charlotte', 'Charleston']
['Pittsburgh', 'Philadelphia', 'Atlantic City', 'Newark', 'Baltimore', 'D.C.', 'Richmond', 'Charleston', 'Dover']

Python regex to capture a comma-delimited list of items

Tags:

python

regex

Scott

2 Answers

alecxe

Hai Vu

Recent Activity

Donate For Us

Python regex to capture a comma-delimited list of items

Tags:

python

regex

Scott

2 Answers

alecxe

Hai Vu

Related questions

Recent Activity

Donate For Us