Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Best way to parse string of email addresses

Tags:

c#

.net

parsing

So i am working with some email header data, and for the to:, from:, cc:, and bcc: fields the email address(es) can be expressed in a number of different ways:

First Last <[email protected]>
Last, First <[email protected]>
[email protected]

And these variations can appear in the same message, in any order, all in one comma separated string:

First, Last <[email protected]>, [email protected], First Last <[email protected]>

I've been trying to come up with a way to parse this string into separate First Name, Last Name, E-Mail for each person (omitting the name if only an email address is provided).

Can someone suggest the best way to do this?

I've tried to Split on the commas, which would work except in the second example where the last name is placed first. I suppose this method could work, if after i split, i examine each element and see if it contains a '@' or '<'/'>', if it doesn't then it could be assumed that the next element is the first name. Is this a good way to approach this? Have i overlooked another format the address could be in?


UPDATE: Perhaps i should clarify a little, basically all i am looking to do is break up the string containing the multiple addresses into individual strings containing the address in whatever format it was sent in. I have my own methods for validating and extracting the information from an address, it was just tricky for me to figure out the best way to separate each address.

Here is the solution i came up with to accomplish this:

String str = "Last, First <[email protected]>, [email protected], First Last <[email protected]>, \"First Last\" <[email protected]>";

List<string> addresses = new List<string>();
int atIdx = 0;
int commaIdx = 0;
int lastComma = 0;
for (int c = 0; c < str.Length; c++)
{
    if (str[c] == '@')
        atIdx = c;

    if (str[c] == ',')
        commaIdx = c;

    if (commaIdx > atIdx && atIdx > 0)
    {
        string temp = str.Substring(lastComma, commaIdx - lastComma);
        addresses.Add(temp);
        lastComma = commaIdx;
        atIdx = commaIdx;
    }

    if (c == str.Length -1)
    {
        string temp = str.Substring(lastComma, str.Legth - lastComma);
        addresses.Add(temp);
    }
}

if (commaIdx < 2)
{
    // if we get here we can assume either there was no comma, or there was only one comma as part of the last, first combo
    addresses.Add(str);
}

The above code generates the individual addresses that i can process further down the line.

like image 595
Jason Miesionczek Avatar asked Jan 16 '09 18:01

Jason Miesionczek


People also ask

How email parsing works?

An email parser works by using information and criteria that you supply to it. You tell it what keywords or terms to search for. Then, it looks for and parses data fields from places like email bodies, sender details, email footers, etc. It does this on your rules and terms.


7 Answers

The clean and short solution is to use MailAddressCollection:

var collection = new MailAddressCollection();
collection.Add(addresses);

This approach parses a list of addresses separated with colon ,, and validates it according to RFC. It throws FormatException in case the addresses are invalid. As suggested in other posts, if you need to deal with invalid addresses, you have to pre-process or parse the value by yourself, otherwise recommending to use what .NET offers without using reflection.

Sample:

var collection = new MailAddressCollection();
collection.Add("Joe Doe <[email protected]>, [email protected]");

foreach (var addr in collection)
{
  // addr.DisplayName, addr.User, addr.Host
}
like image 70
Jakub Míšek Avatar answered Oct 03 '22 01:10

Jakub Míšek


There is internal System.Net.Mail.MailAddressParser class which has method ParseMultipleAddresses which does exactly what you want. You can access it directly through reflection or by calling MailMessage.To.Add method, which accepts email list string.

private static IEnumerable<MailAddress> ParseAddress(string addresses)
{
    var mailAddressParserClass = Type.GetType("System.Net.Mail.MailAddressParser");
    var parseMultipleAddressesMethod = mailAddressParserClass.GetMethod("ParseMultipleAddresses", System.Reflection.BindingFlags.NonPublic | System.Reflection.BindingFlags.Static);
    return (IList<MailAddress>)parseMultipleAddressesMethod.Invoke(null, new object[0]);
}


    private static IEnumerable<MailAddress> ParseAddress(string addresses)
    {
        MailMessage message = new MailMessage();
        message.To.Add(addresses);
        return new List<MailAddress>(message.To); //new List, because we don't want to hold reference on Disposable object
    }
like image 33
user7984361 Avatar answered Oct 03 '22 01:10

user7984361


There isn't really an easy solution to this. I would recommend making a little state machine that reads char-by-char and do the work that way. Like you said, splitting by comma won't always work.

A state machine will allow you to cover all possibilities. I'm sure there are many others you haven't seen yet. For example: "First Last"

Look for the RFC about this to discover what all the possibilities are. Sorry, I don't know the number. There are probably multiple as this is the kind of things that evolves.

like image 35
Sarel Botha Avatar answered Oct 03 '22 01:10

Sarel Botha


At the risk of creating two problems, you could create a regular expression that matches any of your email formats. Use "|" to separate the formats within this one regex. Then you can run it over your input string and pull out all of the matches.

public class Address
{
    private string _first;
    private string _last;
    private string _name;
    private string _domain;

    public Address(string first, string last, string name, string domain)
    {
        _first = first;
        _last = last;
        _name = name;
        _domain = domain;
    }

    public string First
    {
        get { return _first; }
    }

    public string Last
    {
        get { return _last; }
    }

    public string Name
    {
        get { return _name; }
    }

    public string Domain
    {
        get { return _domain; }
    }
}

[TestFixture]
public class RegexEmailTest
{
    [Test]
    public void TestThreeEmailAddresses()
    {
        Regex emailAddress = new Regex(
            @"((?<last>\w*), (?<first>\w*) <(?<name>\w*)@(?<domain>\w*\.\w*)>)|" +
            @"((?<first>\w*) (?<last>\w*) <(?<name>\w*)@(?<domain>\w*\.\w*)>)|" +
            @"((?<name>\w*)@(?<domain>\w*\.\w*))");
        string input = "First, Last <[email protected]>, [email protected], First Last <[email protected]>";

        MatchCollection matches = emailAddress.Matches(input);
        List<Address> addresses =
            (from Match match in matches
             select new Address(
                 match.Groups["first"].Value,
                 match.Groups["last"].Value,
                 match.Groups["name"].Value,
                 match.Groups["domain"].Value)).ToList();
        Assert.AreEqual(3, addresses.Count);

        Assert.AreEqual("Last", addresses[0].First);
        Assert.AreEqual("First", addresses[0].Last);
        Assert.AreEqual("name", addresses[0].Name);
        Assert.AreEqual("domain.com", addresses[0].Domain);

        Assert.AreEqual("", addresses[1].First);
        Assert.AreEqual("", addresses[1].Last);
        Assert.AreEqual("name", addresses[1].Name);
        Assert.AreEqual("domain.com", addresses[1].Domain);

        Assert.AreEqual("First", addresses[2].First);
        Assert.AreEqual("Last", addresses[2].Last);
        Assert.AreEqual("name", addresses[2].Name);
        Assert.AreEqual("domain.com", addresses[2].Domain);
    }
}

There are several down sides to this approach. One is that it doesn't validate the string. If you have any characters in the string that don't fit one of your chosen formats, then those characters are just ignored. Another is that the accepted formats are all expressed in one place. You cannot add new formats without changing the monolithic regex.

like image 40
Michael L Perry Avatar answered Oct 03 '22 00:10

Michael L Perry


Your 2nd email example is not a valid address as it contains a comma which is not within a quoted string. To be valid it should be like: "Last, First"<[email protected]>.

As for parsing, if you want something that is quite strict, you could use System.Net.Mail.MailAddressCollection.

If you just want to your input split into separate email strings, then the following code should work. It is not very strict but will handle commas within quoted strings and throw an exception if the input contains an unclosed quote.

public List<string> SplitAddresses(string addresses)
{
    var result = new List<string>();

    var startIndex = 0;
    var currentIndex = 0;
    var inQuotedString = false;

    while (currentIndex < addresses.Length)
    {
        if (addresses[currentIndex] == QUOTE)
        {
            inQuotedString = !inQuotedString;
        }
        // Split if a comma is found, unless inside a quoted string
        else if (addresses[currentIndex] == COMMA && !inQuotedString)
        {
            var address = GetAndCleanSubstring(addresses, startIndex, currentIndex);
            if (address.Length > 0)
            {
                result.Add(address);
            }
            startIndex = currentIndex + 1;
        }
        currentIndex++;
    }

    if (currentIndex > startIndex)
    {
        var address = GetAndCleanSubstring(addresses, startIndex, currentIndex);
        if (address.Length > 0)
        {
            result.Add(address);
        }
    }

    if (inQuotedString)
        throw new FormatException("Unclosed quote in email addresses");

    return result;
}

private string GetAndCleanSubstring(string addresses, int startIndex, int currentIndex)
{
    var address = addresses.Substring(startIndex, currentIndex - startIndex);
    address = address.Trim();
    return address;
}
like image 35
Phil Haselden Avatar answered Oct 02 '22 23:10

Phil Haselden


There is no generic simple solution to this. The RFC you want is RFC2822, which describes all of the possible configurations of an email address. The best you are going to get that will be correct is to implement a state-based tokenizer that follows the rules specified in the RFC.

like image 40
Scott Dorman Avatar answered Oct 02 '22 23:10

Scott Dorman


Here is the solution i came up with to accomplish this:

String str = "Last, First <[email protected]>, [email protected], First Last <[email protected]>, \"First Last\" <[email protected]>";

List<string> addresses = new List<string>();
int atIdx = 0;
int commaIdx = 0;
int lastComma = 0;
for (int c = 0; c < str.Length; c++)
{
if (str[c] == '@')
    atIdx = c;

if (str[c] == ',')
    commaIdx = c;

if (commaIdx > atIdx && atIdx > 0)
{
    string temp = str.Substring(lastComma, commaIdx - lastComma);
    addresses.Add(temp);
    lastComma = commaIdx;
    atIdx = commaIdx;
}

if (c == str.Length -1)
{
    string temp = str.Substring(lastComma, str.Legth - lastComma);
    addresses.Add(temp);
}
}

if (commaIdx < 2)
{
    // if we get here we can assume either there was no comma, or there was only one comma as part of the last, first combo
    addresses.Add(str);
}
like image 33
Jason Miesionczek Avatar answered Oct 03 '22 01:10

Jason Miesionczek