Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Flexible text parsing strategies

Problem

I'm trying to find a flexible way to parse email content. Below is an example of dummy email text I'm working with. I'd also like to avoid regular expressions if at all possible. However, at this point of my problem solving process I'm beginning to think it's inevitable. Note that this is only a small dummy subset of a full email. What I need is to parse every field (e.g. Ticket No, Cell Phone) into their respective data types. Lastly, some fields are not guaranteed to be present in the email (you'll see in my current solution shown below why this is a problem).

Header Code:EMERGENCY                               
Ticket No:   123456789 Seq. No: 2
Update of:             

Original Call Date:     01/02/2011     Time:      11:17:03 AM  OP: 1102
Second Call Date:     01/02/2011     Time:      12:11:00 AM  OP: 

Company:           COMPANY NAME
Contact:      CONTACT NAME          Contact Phone: (111)111-1111
Secondary Contact: SECONDARY CONTACT
Alternate Contact:                       Altern. Phone:                  
Best Time to Call: AFTER 4:30P           Fax No:        (111)111-1111
Cell Phone:                              Pager No:                       
Caller Address: 330 FOO
                FOO AVENUE 123

Current Solution

For this simple example I'm successfully able to parse most fields with the function below.

private T BetweenOperation<T>(string emailBody, string start, string end)
{
 var culture = StringComparison.InvariantCulture;
 int startIndex =
  emailBody.IndexOf(start, culture) + start.Length;
 int endIndex =
  emailBody.IndexOf(end, culture);
 int length = endIndex - startIndex;

 if (length < 0) return default(T);

 return (T)Convert.ChangeType(
  emailBody.Substring(startIndex, length).Trim(), 
  typeof(T));
}

Essentially, my idea was that I could parse the content between two fields. For example, I could the header code by doing

// returns "EMERGENCY"
BetweenOperation<string>("email content", "Header Code:", "Ticket No:")

This approach however has many flaws. One big flaw being that the end field is not always present . As you can see there are some similar keys with identical keywords that don't parse quite right, such as "Contact" and "Secondary Contact". This causes the parser to fetch too much information. Also, if my end field is not present I'll be getting some unpredictable result. Lastly, I can parse entire lines to then pass it to BetweenOperation<T> using this.

private string LineOperation(string startWithCriteria)
{
    string[] emailLines = EmailBody.Split(new[] { '\n' });

    return 
        emailLines.Where(emailLine => emailLine.StartsWith(startWithCriteria))
        .FirstOrDefault();
}

We would use LineOperation in some cases where the field name is not unique (e.g. Time) and feed the result to BetweenOperation<T>.

Question

How can parse the content shown above based on keys. Keys being "Header Code" and "Cell Phone" for example. Note that I don't think that parsing based on spaces of tabs because some of fields can be several lines long (e.g. Caller Address) or contain no value at all (e.g. Altern Phone).

Thank you.

like image 968
Mike Avatar asked Jan 28 '11 18:01

Mike


2 Answers

In my opinion I would parse it by a specific sequence, and following that, modify your email body accordingly.

Specific sequence

Contact:      CONTACT NAME          Contact Phone: (111)111-1111
Secondary Contact: SECONDARY CONTACT
Alternate Contact: 

The sequence in which to search for your fields should start with words that are not subsets of any other keyword in your "Fields" (E.G For contacts, the sequence should be "Secondary Contact:", "Alternate Contact:" then lastly "Contact:")

Modify your email body, if you found the field information that you require, you will need to modify the email body in order to remove it. Parsing by a specific sequence, will ensure (I hope) that you won't have the whole mismatch issue since you are removing the subsets last.

Now there is also the issue of the end keyword field. Since the end field is not always guaranteed to be there (And I am unsure whether they will always be in a specific order) you would have to loop through all your keyword fields, and return the index and determine the closest keyword based off the index.

like image 106
ChickSentMeHighE Avatar answered Oct 03 '22 15:10

ChickSentMeHighE


One way to approach the problem would be to first search the entire text for occurrences of your keys. That is, build an array that looks like:

"Header Code:",1
"Contact Phone:",233
"Cell Phone:",-1  // not there

If you sort that array by position, then you know where to look for things. That is, you'll know which fields follow each.

You'll have to do something with duplicates (i.e. "Time:" and "Time:" in the call dates). And you'll have to resolve "Contact:" and "Secondary Contact:", although that one should be pretty easy.

If you do this with standard string operations (i.e. IndexOf), it's going to be somewhat inefficient because you'll have to search the entire text for all occurrences of every string. Whether that's a problem for you is hard to say. Depends on how many of these you have to do.

If it becomes a problem, you'll probably want to build an Aho-Corasick string matcher, or something similar. Or you could build up a big ol' ugly regex:

"(Header Code:)|(Contact Phone:)|(Cell Phone)" ... etc. Probably with named captures so you know what you're capturing. It should work reasonably well, although it might be difficult to maintain.

like image 38
Jim Mischel Avatar answered Oct 03 '22 16:10

Jim Mischel