Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Where is a good Address Parser [closed]

People also ask

What does it mean to parse an address?

Parsing is the process of segmenting an address into components like house number, street name, city, etc. Though many address parsers have been written over the years, most are rule-based and only designed to handle US addresses.

How do you parse a street address?

The easiest way to parse an address is by applying a Regex. This method really proves itself when you have regular form addresses. For example, if all the address strings are like STREET_NAME XX, YYYYYY CITY_NAME, you can select a regexp that will split the strings to [STREET_NAME, XX, YYYYYY, CITY_NAME].

What is Libpostal?

libpostal is a C library for parsing/normalizing street addresses around the world using statistical NLP and open data. The goal of this project is to understand location-based strings in every language, everywhere.


The Googlemaps API works pretty well for this. E.g., suppose you are given the string "120 w 45 st nyc". Pass it into the Googlemaps API like so: http://maps.google.com/maps/geo?q=120+w+45+st+nyc and you get this response:

{
  "name": "120 w 45 st nyc",
  "Status": {
    "code": 200,
    "request": "geocode"
  },
  "Placemark": [ {
    "id": "p1",
    "address": "120 W 45th St, New York, NY 10036, USA",
    "AddressDetails": {"Country": {"CountryNameCode": "US","CountryName": "USA","AdministrativeArea": {"AdministrativeAreaName": "NY","Locality": {"LocalityName": "New York","Thoroughfare":{"ThoroughfareName": "120 W 45th St"},"PostalCode": {"PostalCodeNumber": "10036"}}}},"Accuracy": 8},
    "ExtendedData": {
      "LatLonBox": {
        "north": 40.7603883,
        "south": 40.7540931,
        "east": -73.9807141,
        "west": -73.9870093
      }
    },
    "Point": {
      "coordinates": [ -73.9838617, 40.7572407, 0 ]
    }
  } ]
}

If you are looking for a address parser with a simple solution, try this:

http://usaddress.codeplex.com/

Good: 1. No database required 2. No internet lookup required 3. Pretty accurate

Bad: 1. Can not confirm if it is a real address 2. Only works for US address 3. in C#, use .NET 3.5 or above


As has been mentioned, this is not a trivial problem. One of the biggest issues--apart from international addresses--is that there is no standard format for addresses and the fact that an address can't tell you if it's well-formed, i.e. it's not self-validating like a credit card number.

Because of this, you have to rely on an external source of truth to ensure the address is real. This is where an address verification service comes into the mix. Depending upon your business needs and application requirements, you may be looking at a one-time "batch" scrub of your address list, or perhaps a realtime/live address validation service. There are a number of good providers (which vary in cost) that can easily solve this problem.

I should mention that I'm the founder of SmartyStreets. We do CASS-certified address verification. We'll take your unformatted/raw addresses and turn them into addresses which have been cleaned, standardized, and verified/confirmed. Depending on the size of your list, the cost is usually only a few dollars and the turnaround time is nearly instant--usually a few minutes.


You could try Experian Address Verification. Has it's issues but pretty much works as advertised.


As there is no trivial solution like @duffymo said, the next best thing might be to reconsider the design. If it's a user form, make a compromise and let the user fill it. If you are retroactively parsing data, then use a very strict regex to parse addresses based on some criteria (country is US). Then make a second pass at the ones that are left over and so on. I have taken this approach and it's the only reliable approach.

Another design problem with taking a generic regex approach is that it will generate false positive for bad addresses. If you are sending out snail mail to these people, it will end up bouncing and you'll have more work at your hands trying to sort out which ones came back or continue to send mails to erroneous addresses.


I tried RecogniContact recently. It is a Windows COM component that parses US and European addresses. You can test it from the website.

http://www.loquisoft.com/index.php?page=8