I am processing addresses into their respective field format for the database. I can get the house number out and the street type but trying to determine best method to get the street without number and last word. A standard street address received would be:
res[:address] = '7707 Foo Bar Blvd'
As of now I can parse the following:
house = res[:address].gsub(/\D/, '')
street_type = res[:address].split(/\s+/).last
My first challenge is how to get 'Foo Bar'. Note the street name could be one, two or three words. I am struggling to find a one line expression solution for this in Ruby.
My second question is how to perhaps improve on the 'house' code to deal with house numbers that have an alpha at the end. For example, "7707B".
Lastly if you can reference a good cheat sheet with examples for these expression that would be helpful.
I'd recommend using a library for this if possible, since address parsing can be difficult. Check out the Indirizzo Ruby gem, which makes this easy:
require 'Indirizzo'
address = Indirizzo::Address.new("7707 Foo Bar Blvd")
address.number
=> "7707"
address.street
=> ["foo bar blvd", "foo bar boulevard"]
Even if you don't use the Indirizzo library itself, reading through its source code is probably very useful to see how they solved the problem. For instance, it has finely-tuned regular expressions to match different parts of an address:
Match = {
# FIXME: shouldn't have to anchor :number and :zip at start/end
:number => /^(\d+\W|[a-z]+)?(\d+)([a-z]?)\b/io,
:street => /(?:\b(?:\d+\w*|[a-z'-]+)\s*)+/io,
:city => /(?:\b[a-z][a-z'-]+\s*)+/io,
:state => State.regexp,
:zip => /\b(\d{5})(?:-(\d{4}))?\b/o,
:at => /\s(at|@|and|&)\s/io,
:po_box => /\b[P|p]*(OST|ost)*\.*\s*[O|o|0]*(ffice|FFICE)*\.*\s*[B|b][O|o|0][X|x]\b/
}
These files from its source code can give more specifics:
(But I would also generally agree with @drhenner's comment that, in order to make this easier on yourself, you could probably just accept these data inputs in separate fields.)
Edit: To give a more specific answer about how to remove the street suffix (e.g., "Blvd"), you could use Indirizzo's regular expression constants (such as Suffix_Type
from constants.rb
) like so:
address = Indirizzo::Address.new("7707 Foo Bar Blvd", :expand_streets => false)
address.street.map {|street| street.gsub(Indirizzo::Suffix_Type.regexp, '').strip }
=> ["foo bar"]
(Notice I also passed :expand_streets => false
to the initializer, to avoid having both "Blvd" and "Boulevard" alternatives expanded, since we're discarding the suffix anyway.)
You can play fast and loose with named capture groups in a regex
matches = res[:address].match(/^(?<number>\S*)\s+(?<name>.*)\s+(?<type>.*)$/)
number = matches[:number]
house = matches[:name]
street_type = matches[:type]
or if you wanted your regex to be a little more accurate with the type you could replace
(?<type>.*)
with
(?<type>(Blvd|Ave|Rd|St))
and add all the different options you'd want
You could perhaps use something like:
^\S+ (.+?) \S+$
\S
matches any non white space character
^
matches the beginning of the string
$
matches the end of the string
And (.+?)
captures anything in between the two.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With