Problem: I have an address field from an Access database which has been converted to SQL Server 2005. This field has everything all in one field. I need to parse out the address's individual sections into their appropriate fields in a normalized table. I need to do this for approximately 4,000 records, and it needs to be repeatable.
Assumptions:
Assume an address in the US (for now)
assume that the input string will sometimes contain an addressee (the person being addressed) and/or a second street address (i.e. Suite B)
states may be abbreviated
zip code could be standard 5 digits or zip+4
there are typos in some instances
UPDATE: In response to the questions posed, standards were not universally followed; I need need to store the individual values, not just geocode and errors means typo (corrected above)
Sample Data:
A. P. Croll & Son 2299 Lewes-Georgetown Hwy, Georgetown, DE 19947
11522 Shawnee Road, Greenwood DE 19950
144 Kings Highway, S.W. Dover, DE 19901
Intergrated Const. Services 2 Penns Way Suite 405 New Castle, DE 19720
Humes Realty 33 Bridle Ridge Court, Lewes, DE 19958
Nichols Excavation 2742 Pulaski Hwy Newark, DE 19711
2284 Bryn Zion Road, Smyrna, DE 19904
VEI Dover Crossroads, LLC 1500 Serpentine Road, Suite 100 Baltimore MD 21
580 North Dupont Highway Dover, DE 19901
P.O. Box 778 Dover, DE 19903
The easiest way to parse an address is by applying a Regex. This method really proves itself when you have regular form addresses. For example, if all the address strings are like STREET_NAME XX, YYYYYY CITY_NAME, you can select a regexp that will split the strings to [STREET_NAME, XX, YYYYYY, CITY_NAME].
Use the Parse Address tool to breaks down any street address into its component parts, like a street number, street name, suffix, and others. A single parse candidate is returned. You can also parse last line data, containing the city, state, and ZIP code.
I've done a lot of work on this kind of parsing. Because there are errors you won't get 100% accuracy, but there are a few things you can do to get most of the way there, and then do a visual BS test. Here's the general way to go about it. It's not code, because it's pretty academic to write it, there's no weirdness, just lots of string handling.
(Now that you've posted some sample data, I've made some minor changes)
I hope this helps somewhat.
I think outsourcing the problem is the best bet: send it to the Google (or Yahoo) geocoder. The geocoder returns not only the lat/long (which aren't of interest here), but also a rich parsing of the address, with fields filled in that you didn't send (including ZIP+4 and county).
For example, parsing "1600 Amphitheatre Parkway, Mountain View, CA" yields
{ "name": "1600 Amphitheatre Parkway, Mountain View, CA, USA", "Status": { "code": 200, "request": "geocode" }, "Placemark": [ { "address": "1600 Amphitheatre Pkwy, Mountain View, CA 94043, USA", "AddressDetails": { "Country": { "CountryNameCode": "US", "AdministrativeArea": { "AdministrativeAreaName": "CA", "SubAdministrativeArea": { "SubAdministrativeAreaName": "Santa Clara", "Locality": { "LocalityName": "Mountain View", "Thoroughfare": { "ThoroughfareName": "1600 Amphitheatre Pkwy" }, "PostalCode": { "PostalCodeNumber": "94043" } } } } }, "Accuracy": 8 }, "Point": { "coordinates": [-122.083739, 37.423021, 0] } } ] }
Now that's parseable!
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With