I have a Python list like the following elements:
['Alabama[edit]',
 'Auburn (Auburn University)[1]',
 'Florence (University of North Alabama)',
 'Jacksonville (Jacksonville State University)[2]',
 'Livingston (University of West Alabama)[2]',
 'Montevallo (University of Montevallo)[2]',
 'Troy (Troy University)[2]',
 'Tuscaloosa (University of Alabama, Stillman College, Shelton State)[3][4]',
 'Tuskegee (Tuskegee University)[5]',
 'Alaska[edit]',
 'Fairbanks (University of Alaska Fairbanks)[2]',
 'Arizona[edit]',
 'Flagstaff (Northern Arizona University)[6]',
 'Tempe (Arizona State University)',
 'Tucson (University of Arizona)',
 'Arkansas[edit]',
 'Arkadelphia (Henderson State University, Ouachita Baptist University)[2]',
 'Conway (Central Baptist College, Hendrix College, University of Central Arkansas)[2]',
 'Fayetteville (University of Arkansas)[7]']
The list is not complete, but is sufficient to give you an idea of what's in it.
The data is structured like this:
There is a name of a US state and following the state name, there are some names of cities IN THAT STATE. The state name, as you can see ends in "[edit]", and the cities' name either end in a bracket with a number (for example "1", or "[2]"), or with a university's name within parenthesis (for example "(University of North Alabama)").
(Find the full reference file for this problem here)
I ideally want a Python dictionary with the state names as the index, and all the cities' names in that state in a nested listed as a value to that particular index. So, for example the dictionary should be like:
{'Alabama': ['Auburn', 'Florence', 'Jacksonville'...], 'Arizona': ['Flagstaff', 'Temple', 'Tucson', ....], ......}
Now, I tried the following solution, to weed out the unnecessary parts:
import numpy as np
import pandas as pd
    def get_list_of_university_towns():
        '''
        Returns a DataFrame of towns and the states they are in from the 
        university_towns.txt list. The format of the DataFrame should be:
        DataFrame( [ ["Michigan", "Ann Arbor"], ["Michigan", "Yipsilanti"] ], 
        columns=["State", "RegionName"]  )
        The following cleaning needs to be done:
        1. For "State", removing characters from "[" to the end.
        2. For "RegionName", when applicable, removing every character from " (" to the end.
        3. Depending on how you read the data, you may need to remove newline character '\n'. 
        '''
        fhandle = open("university_towns.txt")
        ftext = fhandle.read().split("\n")
        reftext = list()
        for item in ftext:
            reftext.append(item.split(" ")[0])
        #pos = reftext[0].find("[")
        #reftext[0] = reftext[0][:pos]
        towns = list()
        dic = dict()
        for item in reftext:
            if item == "Alabama[edit]":
                state = "Alabama"
            elif item.endswith("[edit]"):
                dic[state] = towns
                towns = list()
                pos = item.find("[")
                item = item[:pos]
                state = item
            else:
                towns.append(item)
        return ftext
    get_list_of_university_towns()
A snippet of my output generated by my code looks like this:
{'Alabama': ['Auburn',
  'Florence',
  'Jacksonville',
  'Livingston',
  'Montevallo',
  'Troy',
  'Tuscaloosa',
  'Tuskegee'],
 'Alaska': ['Fairbanks'],
 'Arizona': ['Flagstaff', 'Tempe', 'Tucson'],
 'Arkansas': ['Arkadelphia',
  'Conway',
  'Fayetteville',
  'Jonesboro',
  'Magnolia',
  'Monticello',
  'Russellville',
  'Searcy'],
 'California': ['Angwin',
  'Arcata',
  'Berkeley',
  'Chico',
  'Claremont',
  'Cotati',
  'Davis',
  'Irvine',
  'Isla',
  'University',
  'Merced',
  'Orange',
  'Palo',
  'Pomona',
  'Redlands',
  'Riverside',
  'Sacramento',
  'University',
  'San',
  'San',
  'Santa',
  'Santa',
  'Turlock',
  'Westwood,',
  'Whittier'],
 'Colorado': ['Alamosa',
  'Boulder',
  'Durango',
  'Fort',
  'Golden',
  'Grand',
  'Greeley',
  'Gunnison',
  'Pueblo,'],
 'Connecticut': ['Fairfield',
  'Middletown',
  'New',
  'New',
  'New',
  'Storrs',
  'Willimantic'],
 'Delaware': ['Dover', 'Newark'],
 'Florida': ['Ave',
  'Boca',
  'Coral',
  'DeLand',
  'Estero',
  'Gainesville',
  'Orlando',
  'Sarasota',
  'St.',
  'St.',
  'Tallahassee',
  'Tampa'],
 'Georgia': ['Albany',
  'Athens',
  'Atlanta',
  'Carrollton',
  'Demorest',
  'Fort',
  'Kennesaw',
  'Milledgeville',
  'Mount',
  'Oxford',
  'Rome',
  'Savannah',
  'Statesboro',
  'Valdosta',
  'Waleska',
  'Young'],
 'Hawaii': ['Manoa'],
But, there is one error in the output: States with a space in their names (e.g. "North Carolina") are not included. I can the the reason behind it.
I thought of using regular expressions, but since I have yet to study about them, I do not know how to form one. Any ideas as to how it could be done with or without the use of Regex?
Praise the power of regular expressions then:
states_rx = re.compile(r'''
^
(?P<state>.+?)\[edit\]
(?P<cities>[\s\S]+?)
(?=^.*\[edit\]$|\Z)
''', re.MULTILINE | re.VERBOSE)
cities_rx = re.compile(r'''^[^()\n]+''', re.MULTILINE)
transformed = '\n'.join(lst_)
result = {state.group('state'): [city.group(0).rstrip() 
        for city in cities_rx.finditer(state.group('cities'))] 
        for state in states_rx.finditer(transformed)}
print(result)
This yields
{'Alabama': ['Auburn', 'Florence', 'Jacksonville', 'Livingston', 'Montevallo', 'Troy', 'Tuscaloosa', 'Tuskegee'], 'Alaska': ['Fairbanks'], 'Arizona': ['Flagstaff', 'Tempe', 'Tucson'], 'Arkansas': ['Arkadelphia', 'Conway', 'Fayetteville']}
The idea is to split the task up into several smaller tasks:
\n
transformed = '\n'.join(your_list)
Second subtask
^                      # match start of the line
(?P<state>.+?)\[edit\] # capture anything in that line up to [edit]
(?P<cities>[\s\S]+?)   # afterwards match anything up to
(?=^.*\[edit\]$|\Z)    # ... either another state or the very end of the string
See the demo on regex101.com.
Third subtask
^[^()\n]+              # match start of the line, anything not a newline character or ( or )
See another demo on regex101.com.
Fourth subtask
result = {state.group('state'): [city.group(0).rstrip() for city in cities_rx.finditer(state.group('cities'))] for state in states_rx.finditer(transformed)}
This is roughly equivalent to:
for state in states_rx.finditer(transformed):
    # state is in state.group('state')
    for city in cities_rx.finditer(state.group('cities')):
        # city is in city.group(0), possibly with whitespaces
        # hence the rstrip
import timeit
print(timeit.timeit(findstatesandcities, number=10**5))
# 12.234304904000965
So running the above a 100.000 times took me round 12 seconds on my computer, so it should be reasonably fast.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With