Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Regular expression for parsing mailing addresses

Tags:

c#

regex

I have an address class that uses a regular expression to parse the house number, street name, and street type from the first line of an address. This code is generally working well, but I'm posting here to share with the community and to see if anyone has suggestions for improvement.

Note: The STREETTYPES and QUADRANT constants contain all of the relevant street types and quadrants respectively.

I've included a subset here:

private const string STREETTYPES = @"ALLEY|ALY|ANNEX|AX|ARCADE|ARC|AVENUE|AV|AVE|BAYOU|BYU|BEACH|...";

private const string QUADRANTS = "N|NORTH|S|SOUTH|E|EAST|W|WEST|NE|NORTHEAST|NW|NORTHWEST|SE|SOUTHEAST|SW|SOUTHWEST";

HouseNumber, Quadrant, StreetName, and StreetType are all properties on the class.

    private void Parse(string line1)
    {
        HouseNumber = string.Empty;
        Quadrant = string.Empty;
        StreetName = string.Empty;
        StreetType = string.Empty;

        if (!String.IsNullOrEmpty(line1))
        {
            string noPeriodsLine1 = String.Copy(line1);
            noPeriodsLine1 = noPeriodsLine1.Replace(".", "");

            string addressParseRegEx =
                @"(?ix)
            ^
            \s*
            (?:
               (?<housenumber>\d+)
               (?:(?:\s+|-)(?<quadrant>" +
                QUADRANTS +
                @"))?
               (?:(?:\s+|-)(?<streetname>\S+(?:\s+\S+)*?))??
               (?:(?:\s+|-)(?<quadrant>" +
                QUADRANTS + @"))?
               (?:(?:\s+|-)(?<streettype>" + STREETTYPES +
                @"))?
               (?:(?:\s+|-)(?<streettypequalifier>(?!(?:" +
                QUADRANTS +
                @"))(?:\d+|\S+)))?
               (?:(?:\s+|-)(?<streettypequadrant>(" +
                QUADRANTS + @")))??
               (?:(?:\s+|-)(?<suffix>(?:ste|suite|po\sbox|apt)\s*\S*))?
            |
               (?:(?:po|postoffice|post\s+office)\s+box\s+(?<postofficebox>\S+))
            )
            \s*
            $
            ";
            Match match = Regex.Match(noPeriodsLine1, addressParseRegEx);
            if (match.Success)
            {
                HouseNumber = match.Groups["housenumber"].Value;
                Quadrant = (string.IsNullOrEmpty(match.Groups["quadrant"].Value)) ? match.Groups["streettypequadrant"].Value : match.Groups["quadrant"].Value;
                if (match.Groups["streetname"].Captures.Count > 1)
                {
                    foreach (Capture capture in match.Groups["streetname"].Captures)
                    {
                        StreetName += capture.Value + " ";
                    }
                    StreetName = StreetName.Trim();
                }
                else
                {
                    StreetName = (string.IsNullOrEmpty(match.Groups["streetname"].Value)) ? match.Groups["streettypequalifier"].Value : match.Groups["streetname"].Value;
                }
                StreetType = match.Groups["streettype"].Value;

                //if the matched street type is found
                //use the abbreviated version...especially for credit bureau calls
                string streetTypeAbbreviation;
                if (StreetTypes.TryGetValue(StreetType.ToUpper(), out streetTypeAbbreviation))
                {
                    StreetType = streetTypeAbbreviation;
                }
            }
        }

    }
like image 231
Matt Ruwe Avatar asked Mar 13 '09 13:03

Matt Ruwe


People also ask

How do you parse an address?

The easiest way to parse an address is by applying a Regex. This method really proves itself when you have regular form addresses. For example, if all the address strings are like STREET_NAME XX, YYYYYY CITY_NAME, you can select a regexp that will split the strings to [STREET_NAME, XX, YYYYYY, CITY_NAME].

What is regular expression parsing?

Regular expression parsing makes finding matches of a regular expression even more useful by allowing us to directly extract subpatterns of the match, e.g., for extracting IP-addresses from internet traffic analysis or extracting subparts of genomes from genetic data bases.


2 Answers

Have fun with addresses and regexs, you're in for a long, horrible ride.

You're trying to lay order upon chaos.

For every "123 Simple Way", there's a "14 1/2 South".

Then, for extra laughs, there's Salt Lake City: "855 South 1300 East".

Have fun with that.

There are more exceptions than rules when it comes to street adresses.

like image 69
Will Hartung Avatar answered Oct 18 '22 11:10

Will Hartung


I don't know what country you're in, but if you're in the USA and want to spend some money on address validation, you can buy related USPS products here. And here is a good place to find free word lists from the USPS for expected words and abbreviations. I'm sure similar pages are available for other countries.

like image 39
Adrian Archer Avatar answered Oct 18 '22 12:10

Adrian Archer