I'm trying to find a flexible way to parse email content. Below is an example of dummy email text I'm working with. I'd also like to avoid regular expressions if at all possible. However, at this point of my problem solving process I'm beginning to think it's inevitable. Note that this is only a small dummy subset of a full email. What I need is to parse every field (e.g. Ticket No, Cell Phone) into their respective data types. Lastly, some fields are not guaranteed to be present in the email (you'll see in my current solution shown below why this is a problem).
Header Code:EMERGENCY
Ticket No: 123456789 Seq. No: 2
Update of:
Original Call Date: 01/02/2011 Time: 11:17:03 AM OP: 1102
Second Call Date: 01/02/2011 Time: 12:11:00 AM OP:
Company: COMPANY NAME
Contact: CONTACT NAME Contact Phone: (111)111-1111
Secondary Contact: SECONDARY CONTACT
Alternate Contact: Altern. Phone:
Best Time to Call: AFTER 4:30P Fax No: (111)111-1111
Cell Phone: Pager No:
Caller Address: 330 FOO
FOO AVENUE 123
For this simple example I'm successfully able to parse most fields with the function below.
private T BetweenOperation<T>(string emailBody, string start, string end)
{
var culture = StringComparison.InvariantCulture;
int startIndex =
emailBody.IndexOf(start, culture) + start.Length;
int endIndex =
emailBody.IndexOf(end, culture);
int length = endIndex - startIndex;
if (length < 0) return default(T);
return (T)Convert.ChangeType(
emailBody.Substring(startIndex, length).Trim(),
typeof(T));
}
Essentially, my idea was that I could parse the content between two fields. For example, I could the header code by doing
// returns "EMERGENCY"
BetweenOperation<string>("email content", "Header Code:", "Ticket No:")
This approach however has many flaws. One big flaw being that the end
field is not always present . As you can see there are some similar keys with identical keywords that don't parse quite right, such as "Contact" and "Secondary Contact". This causes the parser to fetch too much information. Also, if my end field is not present I'll be getting some unpredictable result. Lastly, I can parse entire lines to then pass it to BetweenOperation<T>
using this.
private string LineOperation(string startWithCriteria)
{
string[] emailLines = EmailBody.Split(new[] { '\n' });
return
emailLines.Where(emailLine => emailLine.StartsWith(startWithCriteria))
.FirstOrDefault();
}
We would use LineOperation
in some cases where the field name is not unique (e.g. Time) and feed the result to BetweenOperation<T>
.
How can parse the content shown above based on keys. Keys being "Header Code" and "Cell Phone" for example. Note that I don't think that parsing based on spaces of tabs because some of fields can be several lines long (e.g. Caller Address) or contain no value at all (e.g. Altern Phone).
Thank you.
In my opinion I would parse it by a specific sequence, and following that, modify your email body accordingly.
Specific sequence
Contact: CONTACT NAME Contact Phone: (111)111-1111
Secondary Contact: SECONDARY CONTACT
Alternate Contact:
The sequence in which to search for your fields should start with words that are not subsets of any other keyword in your "Fields" (E.G For contacts, the sequence should be "Secondary Contact:", "Alternate Contact:" then lastly "Contact:")
Modify your email body, if you found the field information that you require, you will need to modify the email body in order to remove it. Parsing by a specific sequence, will ensure (I hope) that you won't have the whole mismatch issue since you are removing the subsets last.
Now there is also the issue of the end keyword field. Since the end field is not always guaranteed to be there (And I am unsure whether they will always be in a specific order) you would have to loop through all your keyword fields, and return the index and determine the closest keyword based off the index.
One way to approach the problem would be to first search the entire text for occurrences of your keys. That is, build an array that looks like:
"Header Code:",1
"Contact Phone:",233
"Cell Phone:",-1 // not there
If you sort that array by position, then you know where to look for things. That is, you'll know which fields follow each.
You'll have to do something with duplicates (i.e. "Time:" and "Time:" in the call dates). And you'll have to resolve "Contact:" and "Secondary Contact:", although that one should be pretty easy.
If you do this with standard string operations (i.e. IndexOf
), it's going to be somewhat inefficient because you'll have to search the entire text for all occurrences of every string. Whether that's a problem for you is hard to say. Depends on how many of these you have to do.
If it becomes a problem, you'll probably want to build an Aho-Corasick string matcher, or something similar. Or you could build up a big ol' ugly regex:
"(Header Code:)|(Contact Phone:)|(Cell Phone)"
... etc. Probably with named captures so you know what you're capturing. It should work reasonably well, although it might be difficult to maintain.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With