Matching states and cities with possibly multiple words

Question

I have a Python list like the following elements:

['Alabama[edit]',
 'Auburn (Auburn University)[1]',
 'Florence (University of North Alabama)',
 'Jacksonville (Jacksonville State University)[2]',
 'Livingston (University of West Alabama)[2]',
 'Montevallo (University of Montevallo)[2]',
 'Troy (Troy University)[2]',
 'Tuscaloosa (University of Alabama, Stillman College, Shelton State)[3][4]',
 'Tuskegee (Tuskegee University)[5]',
 'Alaska[edit]',
 'Fairbanks (University of Alaska Fairbanks)[2]',
 'Arizona[edit]',
 'Flagstaff (Northern Arizona University)[6]',
 'Tempe (Arizona State University)',
 'Tucson (University of Arizona)',
 'Arkansas[edit]',
 'Arkadelphia (Henderson State University, Ouachita Baptist University)[2]',
 'Conway (Central Baptist College, Hendrix College, University of Central Arkansas)[2]',
 'Fayetteville (University of Arkansas)[7]']

The list is not complete, but is sufficient to give you an idea of what's in it.

The data is structured like this:

There is a name of a US state and following the state name, there are some names of cities IN THAT STATE. The state name, as you can see ends in "[edit]", and the cities' name either end in a bracket with a number (for example "1", or "[2]"), or with a university's name within parenthesis (for example "(University of North Alabama)").

(Find the full reference file for this problem here)

I ideally want a Python dictionary with the state names as the index, and all the cities' names in that state in a nested listed as a value to that particular index. So, for example the dictionary should be like:

{'Alabama': ['Auburn', 'Florence', 'Jacksonville'...], 'Arizona': ['Flagstaff', 'Temple', 'Tucson', ....], ......}

Now, I tried the following solution, to weed out the unnecessary parts:

import numpy as np
import pandas as pd

    def get_list_of_university_towns():
        '''
        Returns a DataFrame of towns and the states they are in from the 
        university_towns.txt list. The format of the DataFrame should be:
        DataFrame( [ ["Michigan", "Ann Arbor"], ["Michigan", "Yipsilanti"] ], 
        columns=["State", "RegionName"]  )

        The following cleaning needs to be done:

        1. For "State", removing characters from "[" to the end.
        2. For "RegionName", when applicable, removing every character from " (" to the end.
        3. Depending on how you read the data, you may need to remove newline character '
'. 

        '''

        fhandle = open("university_towns.txt")
        ftext = fhandle.read().split("
")

        reftext = list()
        for item in ftext:
            reftext.append(item.split(" ")[0])

        #pos = reftext[0].find("[")
        #reftext[0] = reftext[0][:pos]

        towns = list()
        dic = dict()

        for item in reftext:
            if item == "Alabama[edit]":
                state = "Alabama"

            elif item.endswith("[edit]"):
                dic[state] = towns
                towns = list()
                pos = item.find("[")
                item = item[:pos]
                state = item

            else:
                towns.append(item)

        return ftext

    get_list_of_university_towns()

A snippet of my output generated by my code looks like this:

{'Alabama': ['Auburn',
  'Florence',
  'Jacksonville',
  'Livingston',
  'Montevallo',
  'Troy',
  'Tuscaloosa',
  'Tuskegee'],
 'Alaska': ['Fairbanks'],
 'Arizona': ['Flagstaff', 'Tempe', 'Tucson'],
 'Arkansas': ['Arkadelphia',
  'Conway',
  'Fayetteville',
  'Jonesboro',
  'Magnolia',
  'Monticello',
  'Russellville',
  'Searcy'],
 'California': ['Angwin',
  'Arcata',
  'Berkeley',
  'Chico',
  'Claremont',
  'Cotati',
  'Davis',
  'Irvine',
  'Isla',
  'University',
  'Merced',
  'Orange',
  'Palo',
  'Pomona',
  'Redlands',
  'Riverside',
  'Sacramento',
  'University',
  'San',
  'San',
  'Santa',
  'Santa',
  'Turlock',
  'Westwood,',
  'Whittier'],
 'Colorado': ['Alamosa',
  'Boulder',
  'Durango',
  'Fort',
  'Golden',
  'Grand',
  'Greeley',
  'Gunnison',
  'Pueblo,'],
 'Connecticut': ['Fairfield',
  'Middletown',
  'New',
  'New',
  'New',
  'Storrs',
  'Willimantic'],
 'Delaware': ['Dover', 'Newark'],
 'Florida': ['Ave',
  'Boca',
  'Coral',
  'DeLand',
  'Estero',
  'Gainesville',
  'Orlando',
  'Sarasota',
  'St.',
  'St.',
  'Tallahassee',
  'Tampa'],
 'Georgia': ['Albany',
  'Athens',
  'Atlanta',
  'Carrollton',
  'Demorest',
  'Fort',
  'Kennesaw',
  'Milledgeville',
  'Mount',
  'Oxford',
  'Rome',
  'Savannah',
  'Statesboro',
  'Valdosta',
  'Waleska',
  'Young'],
 'Hawaii': ['Manoa'],

But, there is one error in the output: States with a space in their names (e.g. "North Carolina") are not included. I can the the reason behind it.

I thought of using regular expressions, but since I have yet to study about them, I do not know how to form one. Any ideas as to how it could be done with or without the use of Regex?

Jan · Accepted Answer

Praise the power of regular expressions then:

states_rx = re.compile(r'''
^
(?P<state>.+?)$$edit$$
(?P<cities>[\s\S]+?)
(?=^.*$$edit$$$|\Z)
''', re.MULTILINE | re.VERBOSE)

cities_rx = re.compile(r'''^[^()\n]+''', re.MULTILINE)

transformed = '\n'.join(lst_)

result = {state.group('state'): [city.group(0).rstrip() 
        for city in cities_rx.finditer(state.group('cities'))] 
        for state in states_rx.finditer(transformed)}
print(result)

This yields

{'Alabama': ['Auburn', 'Florence', 'Jacksonville', 'Livingston', 'Montevallo', 'Troy', 'Tuscaloosa', 'Tuskegee'], 'Alaska': ['Fairbanks'], 'Arizona': ['Flagstaff', 'Tempe', 'Tucson'], 'Arkansas': ['Arkadelphia', 'Conway', 'Fayetteville']}

Explanation:

The idea is to split the task up into several smaller tasks:

Join the complete list with \n
Separate states
Separate towns
Use a dict comprehension for all found items

First subtask

transformed = '\n'.join(your_list)

Second subtask

^                      # match start of the line
(?P<state>.+?)$$edit$$ # capture anything in that line up to [edit]
(?P<cities>[\s\S]+?)   # afterwards match anything up to
(?=^.*$$edit$$$|\Z)    # ... either another state or the very end of the string

See the demo on regex101.com.

Third subtask

^[^()\n]+              # match start of the line, anything not a newline character or ( or )

See another demo on regex101.com.

Fourth subtask

result = {state.group('state'): [city.group(0).rstrip() for city in cities_rx.finditer(state.group('cities'))] for state in states_rx.finditer(transformed)}

This is roughly equivalent to:

for state in states_rx.finditer(transformed):
    # state is in state.group('state')
    for city in cities_rx.finditer(state.group('cities')):
        # city is in city.group(0), possibly with whitespaces
        # hence the rstrip

Lastly, some timing issues:

import timeit
print(timeit.timeit(findstatesandcities, number=10**5))
# 12.234304904000965

So running the above a 100.000 times took me round 12 seconds on my computer, so it should be reasonably fast.

Matching states and cities with possibly multiple words

Tags:

python

regex

python-3.x

CuriousLearner

1 Answers

Explanation:

Jan

Recent Activity

Donate For Us

Matching states and cities with possibly multiple words

Tags:

python

regex

python-3.x

CuriousLearner

1 Answers

Explanation:

Jan

Related questions

Recent Activity

Donate For Us