Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Parsing street addresses in Ruby

I am processing addresses into their respective field format for the database. I can get the house number out and the street type but trying to determine best method to get the street without number and last word. A standard street address received would be:

    res[:address] = '7707 Foo Bar Blvd'

As of now I can parse the following:

    house = res[:address].gsub(/\D/, '')
    street_type = res[:address].split(/\s+/).last

My first challenge is how to get 'Foo Bar'. Note the street name could be one, two or three words. I am struggling to find a one line expression solution for this in Ruby.

My second question is how to perhaps improve on the 'house' code to deal with house numbers that have an alpha at the end. For example, "7707B".

Lastly if you can reference a good cheat sheet with examples for these expression that would be helpful.

like image 720
Stuart C Avatar asked Apr 21 '13 18:04

Stuart C


3 Answers

I'd recommend using a library for this if possible, since address parsing can be difficult. Check out the Indirizzo Ruby gem, which makes this easy:

require 'Indirizzo'
address = Indirizzo::Address.new("7707 Foo Bar Blvd")
address.number
 => "7707"
address.street
 => ["foo bar blvd", "foo bar boulevard"] 

Even if you don't use the Indirizzo library itself, reading through its source code is probably very useful to see how they solved the problem. For instance, it has finely-tuned regular expressions to match different parts of an address:

Match = {
  # FIXME: shouldn't have to anchor :number and :zip at start/end
  :number   => /^(\d+\W|[a-z]+)?(\d+)([a-z]?)\b/io,
  :street   => /(?:\b(?:\d+\w*|[a-z'-]+)\s*)+/io,
  :city     => /(?:\b[a-z][a-z'-]+\s*)+/io,
  :state    => State.regexp,
  :zip      => /\b(\d{5})(?:-(\d{4}))?\b/o,
  :at       => /\s(at|@|and|&)\s/io,
  :po_box => /\b[P|p]*(OST|ost)*\.*\s*[O|o|0]*(ffice|FFICE)*\.*\s*[B|b][O|o|0][X|x]\b/
}

These files from its source code can give more specifics:

  • https://github.com/daveworth/Indirizzo/blob/master/lib/indirizzo/address.rb
  • https://github.com/daveworth/Indirizzo/blob/master/lib/indirizzo/constants.rb
  • https://github.com/daveworth/Indirizzo/blob/master/lib/indirizzo/numbers.rb

(But I would also generally agree with @drhenner's comment that, in order to make this easier on yourself, you could probably just accept these data inputs in separate fields.)

Edit: To give a more specific answer about how to remove the street suffix (e.g., "Blvd"), you could use Indirizzo's regular expression constants (such as Suffix_Type from constants.rb) like so:

address = Indirizzo::Address.new("7707 Foo Bar Blvd", :expand_streets => false)
address.street.map {|street| street.gsub(Indirizzo::Suffix_Type.regexp, '').strip }
 => ["foo bar"]

(Notice I also passed :expand_streets => false to the initializer, to avoid having both "Blvd" and "Boulevard" alternatives expanded, since we're discarding the suffix anyway.)

like image 149
Stuart M Avatar answered Nov 11 '22 14:11

Stuart M


You can play fast and loose with named capture groups in a regex

matches = res[:address].match(/^(?<number>\S*)\s+(?<name>.*)\s+(?<type>.*)$/)
number = matches[:number]
house = matches[:name]
street_type = matches[:type]

or if you wanted your regex to be a little more accurate with the type you could replace (?<type>.*) with (?<type>(Blvd|Ave|Rd|St)) and add all the different options you'd want

like image 25
airazor Avatar answered Nov 11 '22 14:11

airazor


You could perhaps use something like:

^\S+ (.+?) \S+$

\S matches any non white space character

^ matches the beginning of the string

$ matches the end of the string

And (.+?) captures anything in between the two.

like image 24
Jerry Avatar answered Nov 11 '22 15:11

Jerry