Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python regex to capture a comma-delimited list of items

Tags:

python

regex

I have a list of weather forecasts that start with a similar prefix that I'd like to remove. I'd also like to capture the city names:

Some Examples:

If you have vacation or wedding plans in Phoenix, Tucson, Flagstaff, Salt Lake City, Park City, Denver, Estes Park, Colorado Springs, Pueblo, or Albuquerque, the week will...

If you have vacation or wedding plans for Miami, Jacksonville, Macon, Charlotte, or Charleston, expect a couple systems...

If you have vacation or wedding plans in Pittsburgh, Philadelphia, Atlantic City, Newark, Baltimore, D.C., Richmond, Charleston, or Dover, expect the week...

The strings start with a common prefix "If you have vacation or wedding plans in" and the last city has "or" before it. The list of cities is of variable length.

I've tried this:

>>> text = 'If you have vacation or wedding plans in NYC, Boston, Manchester, Concord, Providence, or Portland'
>>> re.search(r'^If you have vacation or wedding plans in ((\b\w+\b), ?)+ or (\w+)', text).groups()
('Providence,', 'Providence', 'Portland')
>>>

I think I'm pretty close, but obviously it's not working. I've never tried to do something with a variable number of captured items; any guidance would be greatly appreciated.

like image 467
Scott Avatar asked Oct 31 '22 14:10

Scott


2 Answers

Alternative solution here (probably just for sharing and educational purposes).

If you were to solve it with nltk, it would be called a Named Entity Recognition problem. Using the snippet based on nltk.chunk.ne_chunk_sents(), provided here:

import nltk


def extract_entity_names(t):
    entity_names = []

    if hasattr(t, 'label') and t.label:
        if t.label() == 'NE':
            entity_names.append(' '.join([child[0] for child in t]))
        else:
            for child in t:
                entity_names.extend(extract_entity_names(child))

    return entity_names


sample = "If you have vacation or wedding plans in Phoenix, Tucson, Flagstaff, Salt Lake City, Park City, Denver, Estes Park, Colorado Springs, Pueblo, or Albuquerque, the week will..."

sentences = nltk.sent_tokenize(sample)
tokenized_sentences = [nltk.word_tokenize(sentence) for sentence in sentences]
tagged_sentences = [nltk.pos_tag(sentence) for sentence in tokenized_sentences]
chunked_sentences = nltk.ne_chunk_sents(tagged_sentences, binary=True)

entity_names = []
for tree in chunked_sentences:
    entity_names.extend(extract_entity_names(tree))

print entity_names

It prints exactly the desired result:

['Phoenix', 'Tucson', 'Flagstaff', 'Salt Lake City', 'Park City', 'Denver', 'Estes Park', 'Colorado Springs', 'Pueblo', 'Albuquerque']
like image 162
alecxe Avatar answered Nov 08 '22 17:11

alecxe


Here is my approach: use the csv module to parse the lines (I assume they are in a text file named data.csv, please change to suite your situation). After parsing each line:

  1. Discard the last cell, it is not a city name
  2. Remove 'If ...' from the first cell
  3. Remove or 'or ' from the last cell (used to be next-to-last)

Here is the code:

import csv


def cleanup(row):
    new_row = row[:-1]
    new_row[0] = new_row[0].replace('If you have vacation or wedding plans in ', '')
    new_row[0] = new_row[0].replace('If you have vacation or wedding plans for ', '')
    new_row[-1] = new_row[-1].replace('or ', '')
    return new_row

if __name__ == '__main__':
    with open('data.csv') as f:
        reader = csv.reader(f, skipinitialspace=True)
        for row in reader:
            row = cleanup(row)
            print row

Output:

['Phoenix', 'Tucson', 'Flagstaff', 'Salt Lake City', 'Park City', 'Denver', 'Estes Park', 'Colorado Springs', 'Pueblo', 'Albuquerque']
['Miami', 'Jacksonville', 'Macon', 'Charlotte', 'Charleston']
['Pittsburgh', 'Philadelphia', 'Atlantic City', 'Newark', 'Baltimore', 'D.C.', 'Richmond', 'Charleston', 'Dover']
like image 41
Hai Vu Avatar answered Nov 08 '22 17:11

Hai Vu