Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Cleaning email chain for text analysis python

Tags:

python

text

I've got some text:

text = """From: 'Mark Twain' <[email protected]>
To: 'Edgar Allen Poe' <[email protected]>
Subject: RE:Hello!

Ed,

I just read the Tell Tale Heart. You\'ve got problems man.

Sincerely,
Marky Mark

From: 'Edgar Allen Poe' <[email protected]>
To: 'Mark Twain' <[email protected]>
Subject: RE: Hello!

Mark,

The world is crushing my soul, and so are you.

Regards,
Edgar"""

Which looks like this:

"From: 'Mark Twain' <[email protected]>\nTo: 'Edgar Allen Poe' <[email protected]>\nSubject: RE:Hello!\n\nEd,\n\nI just read the Tell Tale Heart. You've got problems man.\n\nSincerely,\nMarky Mark\n\nFrom: 'Edgar Allen Poe' <[email protected]>\nTo: 'Mark Twain' <[email protected]>\nSubject: RE: Hello!\n\nMark,\n\nThe world is crushing my soul, and so are you.\n\nRegards,\nEdgar"

I'm trying to parse out the messages within them. Ultimately I'd like to have a list or dictionary where I have the From and To, and then the message body with which to do some analysis on.

I've tried parsing it out by turning everything lower, and then string splitting.

text = text.lower()
text = text.translate(string.punctuation)
text_list = text.split('+')
text_list = [x for x in text_list if len(x) != 0]

is there a better way to do this?

like image 577
Matt W. Avatar asked Aug 03 '18 15:08

Matt W.


People also ask

How do you clean up a text file in Python?

Clear a Text File Using the open() Function in write Mode Opening a file in write mode clears its data. Also, if the file specified doesn't exist, Python will create a new one. The simplest way to delete a file is to use open() and assign it to a new variable in write mode.


3 Answers

You can use re to split messages (explanation of this regexp on external site). The result is list of dicts with keys 'from', 'to', 'subject' and 'message':

text = """From: 'Mark Twain' <[email protected]>
To: 'Edgar Allen Poe' <[email protected]>
Subject: RE:Hello!

Ed,

I just read the Tell Tale Heart. You\'ve got problems man.

Sincerely,
Marky Mark

From: 'Edgar Allen Poe' <[email protected]>
To: 'Mark Twain' <[email protected]>
Subject: RE: Hello!

Mark,

The world is crushing my soul, and so are you.

Regards,
Edgar"""

import re
from pprint import pprint

groups = re.findall(r'^From:(.*?)To:(.*?)Subject:(.*?)$(.*?)(?=^From:|\Z)', text, flags=re.DOTALL|re.M)
emails = []
for g in groups:
    d = {}
    d['from'] = g[0].strip()
    d['to'] = g[1].strip()
    d['subject'] = g[2].strip()
    d['message'] = g[3].strip()
    emails.append(d)

pprint(emails)

Prints:

[{'from': "'Mark Twain' <[email protected]>",
  'message': 'Ed,\n'
             '\n'
             "I just read the Tell Tale Heart. You've got problems man.\n"
             '\n'
             'Sincerely,\n'
             'Marky Mark',
  'subject': 'RE:Hello!',
  'to': "'Edgar Allen Poe' <[email protected]>"},
 {'from': "'Edgar Allen Poe' <[email protected]>",
  'message': 'Mark,\n'
             '\n'
             'The world is crushing my soul, and so are you.\n'
             '\n'
             'Regards,\n'
             'Edgar',
  'subject': 'RE: Hello!',
  'to': "'Mark Twain' <[email protected]>"}]
like image 109
Andrej Kesely Avatar answered Sep 23 '22 05:09

Andrej Kesely


That's not how str.translate works. Your text.translate(string.punctuation) uses the punctuation chars as a translation table, so it maps '\n', which is codepoint 10 to the 10th char in string.punctuation, which is '+'. The usual way to use str.translate is to first create a translation table using str.maketrans, which lets you specify chars to map from, the corresponding chars to map to, and (optionally) chars to delete. If you just want to use it for deletion you can create the table using dict.fromkeys, eg

table = dict.fromkeys([ord(c) for c in string.punctuation])

which makes a dict associating the codepoint of each char in string.punctuation to None.

Here's a repaired version of your code that uses str.translate to perform the case conversion and the punctuation deletion in a single step.

# Map upper case to lower case & remove punctuation
table = str.maketrans(string.ascii_uppercase, 
    string.ascii_lowercase, string.punctuation)

text = text.translate(table)
text_list = text.split('\n')
for row in text_list:
    print(repr(row))

output

'from mark twain marktwaingmailcom'
'to edgar allen poe eapgmailcom'
'subject rehello'
''
'ed'
''
'i just read the tell tale heart youve got problems man'
''
'sincerely'
'marky mark'
''
'from edgar allen poe eapgmailcom'
'to mark twain marktwaingmailcom'
'subject re hello'
''
'mark'
''
'the world is crushing my soul and so are you'
''
'regards'
'edgar'

However, simply deleting all the punctuation is a bit messy, since it joins some words that you may not want joined. Instead, we can translate each punctuation char to a space, and then split on whitespace:

# Map all punctuation to space
table = dict.fromkeys([ord(c) for c in string.punctuation], ' ')
text = text.translate(table).lower()
text_list = text.split()
print(text_list)

output

['from', 'mark', 'twain', 'mark', 'twain', 'gmail', 'com', 'to', 'edgar', 'allen', 'poe', 'eap', 'gmail', 'com', 'subject', 're', 'hello', 'ed', 'i', 'just', 'read', 'the', 'tell', 'tale', 'heart', 'you', 've', 'got', 'problems', 'man', 'sincerely', 'marky', 'mark', 'from', 'edgar', 'allen', 'poe', 'eap', 'gmail', 'com', 'to', 'mark', 'twain', 'mark', 'twain', 'gmail', 'com', 'subject', 're', 'hello', 'mark', 'the', 'world', 'is', 'crushing', 'my', 'soul', 'and', 'so', 'are', 'you', 'regards', 'edgar']
like image 24
PM 2Ring Avatar answered Sep 25 '22 05:09

PM 2Ring


If all you wanted to achieve was to parse a string containing a standard-format email, then use the email.parser module; it is part of the standard library.

You'll still need to separate the emails in the larger text, but the From: ... header can help there, using a regular expression:

import re
from email import parser, policy

email_start = re.compile(r'(?<=\n)\n(?=From:\s+)')

parser = parser.Parser(policy=policy.default)

for email_text in email_start.split(text):
    message = parser.parsestr(email_text)
    to, from_ = message['to'], message['from']
    body = message.get_payload()
    # do something with the email details

The regular expression matches any newline character that is directly preceded by another newline (so there is an empty line), followed by the text From: and at least one space (so the next line looks like an email From: header).

Trying to get those same parts by removing or replacing punctuation is not a very effective method of getting the same information, even when you use the tools correctly.

Demo:

>>> import re
>>> from email import parser, policy
>>> email_start = re.compile(r'(?<=\n)\n(?=From:\s+)')
>>> parser = parser.Parser(policy=policy.default)
>>> for email_text in email_start.split(text):
...     message = parser.parsestr(email_text)
...     to, from_ = message['to'], message['from']
...     body = message.get_payload()
...     print('Email from:', from_)
...     print('Email to:', to)
...     print('Third line:', body.splitlines(True)[2])
...
Email from: 'Mark Twain' <[email protected]>
Email to: 'Edgar Allen Poe' <[email protected]>
Third line: I just read the Tell Tale Heart. You've got problems man.

Email from: 'Edgar Allen Poe' <[email protected]>
Email to: 'Mark Twain' <[email protected]>
Third line: The world is crushing my soul, and so are you.
like image 37
Martijn Pieters Avatar answered Sep 22 '22 05:09

Martijn Pieters