I've got some text:
text = """From: 'Mark Twain' <[email protected]>
To: 'Edgar Allen Poe' <[email protected]>
Subject: RE:Hello!
Ed,
I just read the Tell Tale Heart. You\'ve got problems man.
Sincerely,
Marky Mark
From: 'Edgar Allen Poe' <[email protected]>
To: 'Mark Twain' <[email protected]>
Subject: RE: Hello!
Mark,
The world is crushing my soul, and so are you.
Regards,
Edgar"""
Which looks like this:
"From: 'Mark Twain' <[email protected]>\nTo: 'Edgar Allen Poe' <[email protected]>\nSubject: RE:Hello!\n\nEd,\n\nI just read the Tell Tale Heart. You've got problems man.\n\nSincerely,\nMarky Mark\n\nFrom: 'Edgar Allen Poe' <[email protected]>\nTo: 'Mark Twain' <[email protected]>\nSubject: RE: Hello!\n\nMark,\n\nThe world is crushing my soul, and so are you.\n\nRegards,\nEdgar"
I'm trying to parse out the messages within them. Ultimately I'd like to have a list or dictionary where I have the From and To, and then the message body with which to do some analysis on.
I've tried parsing it out by turning everything lower, and then string splitting.
text = text.lower()
text = text.translate(string.punctuation)
text_list = text.split('+')
text_list = [x for x in text_list if len(x) != 0]
is there a better way to do this?
Clear a Text File Using the open() Function in write Mode Opening a file in write mode clears its data. Also, if the file specified doesn't exist, Python will create a new one. The simplest way to delete a file is to use open() and assign it to a new variable in write mode.
You can use re
to split messages (explanation of this regexp on external site). The result is list of dicts with keys 'from'
, 'to'
, 'subject'
and 'message'
:
text = """From: 'Mark Twain' <[email protected]>
To: 'Edgar Allen Poe' <[email protected]>
Subject: RE:Hello!
Ed,
I just read the Tell Tale Heart. You\'ve got problems man.
Sincerely,
Marky Mark
From: 'Edgar Allen Poe' <[email protected]>
To: 'Mark Twain' <[email protected]>
Subject: RE: Hello!
Mark,
The world is crushing my soul, and so are you.
Regards,
Edgar"""
import re
from pprint import pprint
groups = re.findall(r'^From:(.*?)To:(.*?)Subject:(.*?)$(.*?)(?=^From:|\Z)', text, flags=re.DOTALL|re.M)
emails = []
for g in groups:
d = {}
d['from'] = g[0].strip()
d['to'] = g[1].strip()
d['subject'] = g[2].strip()
d['message'] = g[3].strip()
emails.append(d)
pprint(emails)
Prints:
[{'from': "'Mark Twain' <[email protected]>",
'message': 'Ed,\n'
'\n'
"I just read the Tell Tale Heart. You've got problems man.\n"
'\n'
'Sincerely,\n'
'Marky Mark',
'subject': 'RE:Hello!',
'to': "'Edgar Allen Poe' <[email protected]>"},
{'from': "'Edgar Allen Poe' <[email protected]>",
'message': 'Mark,\n'
'\n'
'The world is crushing my soul, and so are you.\n'
'\n'
'Regards,\n'
'Edgar',
'subject': 'RE: Hello!',
'to': "'Mark Twain' <[email protected]>"}]
That's not how str.translate
works. Your text.translate(string.punctuation)
uses the punctuation chars as a translation table, so it maps '\n', which is codepoint 10 to the 10th char in string.punctuation
, which is '+'. The usual way to use str.translate
is to first create a translation table using str.maketrans
, which lets you specify chars to map from, the corresponding chars to map to, and (optionally) chars to delete. If you just want to use it for deletion you can create the table using dict.fromkeys
, eg
table = dict.fromkeys([ord(c) for c in string.punctuation])
which makes a dict associating the codepoint of each char in string.punctuation
to None
.
Here's a repaired version of your code that uses str.translate
to perform the case conversion and the punctuation deletion in a single step.
# Map upper case to lower case & remove punctuation
table = str.maketrans(string.ascii_uppercase,
string.ascii_lowercase, string.punctuation)
text = text.translate(table)
text_list = text.split('\n')
for row in text_list:
print(repr(row))
output
'from mark twain marktwaingmailcom'
'to edgar allen poe eapgmailcom'
'subject rehello'
''
'ed'
''
'i just read the tell tale heart youve got problems man'
''
'sincerely'
'marky mark'
''
'from edgar allen poe eapgmailcom'
'to mark twain marktwaingmailcom'
'subject re hello'
''
'mark'
''
'the world is crushing my soul and so are you'
''
'regards'
'edgar'
However, simply deleting all the punctuation is a bit messy, since it joins some words that you may not want joined. Instead, we can translate each punctuation char to a space, and then split on whitespace:
# Map all punctuation to space
table = dict.fromkeys([ord(c) for c in string.punctuation], ' ')
text = text.translate(table).lower()
text_list = text.split()
print(text_list)
output
['from', 'mark', 'twain', 'mark', 'twain', 'gmail', 'com', 'to', 'edgar', 'allen', 'poe', 'eap', 'gmail', 'com', 'subject', 're', 'hello', 'ed', 'i', 'just', 'read', 'the', 'tell', 'tale', 'heart', 'you', 've', 'got', 'problems', 'man', 'sincerely', 'marky', 'mark', 'from', 'edgar', 'allen', 'poe', 'eap', 'gmail', 'com', 'to', 'mark', 'twain', 'mark', 'twain', 'gmail', 'com', 'subject', 're', 'hello', 'mark', 'the', 'world', 'is', 'crushing', 'my', 'soul', 'and', 'so', 'are', 'you', 'regards', 'edgar']
If all you wanted to achieve was to parse a string containing a standard-format email, then use the email.parser
module; it is part of the standard library.
You'll still need to separate the emails in the larger text, but the From: ...
header can help there, using a regular expression:
import re
from email import parser, policy
email_start = re.compile(r'(?<=\n)\n(?=From:\s+)')
parser = parser.Parser(policy=policy.default)
for email_text in email_start.split(text):
message = parser.parsestr(email_text)
to, from_ = message['to'], message['from']
body = message.get_payload()
# do something with the email details
The regular expression matches any newline character that is directly preceded by another newline (so there is an empty line), followed by the text From:
and at least one space (so the next line looks like an email From:
header).
Trying to get those same parts by removing or replacing punctuation is not a very effective method of getting the same information, even when you use the tools correctly.
Demo:
>>> import re
>>> from email import parser, policy
>>> email_start = re.compile(r'(?<=\n)\n(?=From:\s+)')
>>> parser = parser.Parser(policy=policy.default)
>>> for email_text in email_start.split(text):
... message = parser.parsestr(email_text)
... to, from_ = message['to'], message['from']
... body = message.get_payload()
... print('Email from:', from_)
... print('Email to:', to)
... print('Third line:', body.splitlines(True)[2])
...
Email from: 'Mark Twain' <[email protected]>
Email to: 'Edgar Allen Poe' <[email protected]>
Third line: I just read the Tell Tale Heart. You've got problems man.
Email from: 'Edgar Allen Poe' <[email protected]>
Email to: 'Mark Twain' <[email protected]>
Third line: The world is crushing my soul, and so are you.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With