I'm writing an application that analyzes emails and it would save me a bunch of time if I could use a python library that would parse email text down into named components like <salutation><body><signature><reply text>
etc.
For example, the following text "Hi Dave,\nLets meet up this Tuesday\nCheers, Tom\n\nOn Sunday, 15 May 2011 at 5:02 PM, Dave Trindall wrote: Hey Tom,\nHow about we get together ...
" would be parsed as
Salutation: "Hi Dave,\n"
Body: "Lets meet up this Tuesday\n"
Signature: "Cheers, Tom\n\n"
Reply Text: "On Sunday, 15 May 2011 at 5:02 PM, Dave Trindal wrote: ..."
I know there's no perfect solution for this kind of problem, but even a library that does good approximation would help. Where can I find one?
https://github.com/Trindaz/EFZP
This provides functionality posed in the original question, plus fair recognition of email zones as they commonly appear in email written by native English speakers from common email clients like Outlook and Gmail.
If you score each line based on the types of words it contains you may get a fairly good indication.
E.G. A line with greeting words near the start is the salutation (also salutations may have phrases that refer to the past tense e.g. it was good to see you last time)
A Body will typically contain words such as "movie, concert" etc. It will also contain verbs (go to, run, walk, etc) and questions marks and offerings (e.g. want to, can we, should we, prefer..). Check out http://nodebox.net/code/index.php/Linguistics#verb_conjugation http://ogden.basic-english.org/ http://osteele.com/projects/pywordnet/
the signature will contain closing words.
If you find a datasource that has messages of the structure you want you could do some frequency analysis to see how often each word occurs in each section.
Each word would get a score [salutation score, body score, signature score,..] e.g. hello could occur 900 times in the salutation, 10 times in the body, and 3 times in the signature. this means hello would get assigned [900, 10, 3, ..] cheers might get assigned [10,3,100,..]
now you will have a large list of about 500,000 words. words that don't have a large range aren't useful. e.g. catch might have [100,101,80..] = range of 21 (it was good to catch up, wanna go catch a fish, catch you later). catch can occur anywhere.
Now you can reduce the number of words down to about 10,000
now for each line, give the line a score also of the form [salutation score, body score, signature score,..]
this score is calculated by adding the vector scores of each word.
e.g. a sentence "hello cheers for giving me your number" could be: [900, 10, 3, ..] + [10,3,100,..] + .. + .. + = [900+10+..,10+3+..,3+100,..] =[1023,900,500,..] say
then because the biggest number is at the start in the salutation score position, this sentence is a salutation.
then if you had to score one of your lines to see what component the line should be in, for each word you would add on its score
Good luck, there is always a trade-off between computation complexity and accuracy. If you can find a good set of words and make a good model to base you calculations it will help.
The first approach that comes to mind (not necessarily the best...) would be to start off by using split. here's a little bit of code and stuff
linearray=emailtext.split('\n') now you have an array of strings, each one like a paragraph or whatever
so linearray[0] would contain the salutation
deciding where the reply text starts is a little more tricky, i noticed that there is a double newline just before it so maybe do a search for that from the back and hope that the last one indicates the start of the reply text.
Or store some signature words you might expect and search for those from the front, like cheers, regards, and whatever else.
Once you figure out where the signature is the rest is the rest is easy
hope this helped
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With