I've got some text: <pre class="prettyprint"><code>text = """From: 'Mark Twain' <mark.twain@gmail.com> To: 'Edgar Allen Poe' <eap@gmail.com> Subject: RE:Hello! Ed, I just read the Tell Tale Heart. You\'ve got problems man. Sincerely, Marky Mark From: 'Edgar Allen Poe' <eap@gmail.com> To: 'Mark Twain' <mark.twain@gmail.com> Subject: RE: Hello! Mark, The world is crushing my soul, and so are you. Regards, Edgar""" </code></pre> Which looks like this: <pre class="prettyprint"><code>"From: 'Mark Twain' <mark.twain@gmail.com>\nTo: 'Edgar Allen Poe' <eap@gmail.com>\nSubject: RE:Hello!\n\nEd,\n\nI just read the Tell Tale Heart. You've got problems man.\n\nSincerely,\nMarky Mark\n\nFrom: 'Edgar Allen Poe' <eap@gmail.com>\nTo: 'Mark Twain' <mark.twain@gmail.com>\nSubject: RE: Hello!\n\nMark,\n\nThe world is crushing my soul, and so are you.\n\nRegards,\nEdgar" </code></pre> I'm trying to parse out the messages within them. Ultimately I'd like to have a list or dictionary where I have the From and To, and then the message body with which to do some analysis on. I've tried parsing it out by turning everything lower, and then string splitting. <pre class="prettyprint"><code>text = text.lower() text = text.translate(string.punctuation) text_list = text.split('+') text_list = [x for x in text_list if len(x) != 0] </code></pre> is there a better way to do this?

That's not how <code>str.translate</code> works. Your <code>text.translate(string.punctuation)</code> uses the punctuation chars as a translation table, so it maps '\n', which is codepoint 10 to the 10th char in <code>string.punctuation</code>, which is '+'. The usual way to use <code>str.translate</code> is to first create a translation table using <code>str.maketrans</code>, which lets you specify chars to map from, the corresponding chars to map to, and (optionally) chars to delete. If you just want to use it for deletion you can create the table using <code>dict.fromkeys</code>, eg <pre class="prettyprint"><code>table = dict.fromkeys([ord(c) for c in string.punctuation]) </code></pre> which makes a dict associating the codepoint of each char in <code>string.punctuation</code> to <code>None</code>. Here's a repaired version of your code that uses <code>str.translate</code> to perform the case conversion and the punctuation deletion in a single step. <pre class="prettyprint"><code># Map upper case to lower case & remove punctuation table = str.maketrans(string.ascii_uppercase, string.ascii_lowercase, string.punctuation) text = text.translate(table) text_list = text.split('\n') for row in text_list: print(repr(row)) </code></pre> output <pre class="prettyprint"><code>'from mark twain marktwaingmailcom' 'to edgar allen poe eapgmailcom' 'subject rehello' '' 'ed' '' 'i just read the tell tale heart youve got problems man' '' 'sincerely' 'marky mark' '' 'from edgar allen poe eapgmailcom' 'to mark twain marktwaingmailcom' 'subject re hello' '' 'mark' '' 'the world is crushing my soul and so are you' '' 'regards' 'edgar' </code></pre> <hr> However, simply deleting all the punctuation is a bit messy, since it joins some words that you may not want joined. Instead, we can translate each punctuation char to a space, and then split on whitespace: <pre class="prettyprint"><code># Map all punctuation to space table = dict.fromkeys([ord(c) for c in string.punctuation], ' ') text = text.translate(table).lower() text_list = text.split() print(text_list) </code></pre> output <pre class="prettyprint"><code>['from', 'mark', 'twain', 'mark', 'twain', 'gmail', 'com', 'to', 'edgar', 'allen', 'poe', 'eap', 'gmail', 'com', 'subject', 're', 'hello', 'ed', 'i', 'just', 'read', 'the', 'tell', 'tale', 'heart', 'you', 've', 'got', 'problems', 'man', 'sincerely', 'marky', 'mark', 'from', 'edgar', 'allen', 'poe', 'eap', 'gmail', 'com', 'to', 'mark', 'twain', 'mark', 'twain', 'gmail', 'com', 'subject', 're', 'hello', 'mark', 'the', 'world', 'is', 'crushing', 'my', 'soul', 'and', 'so', 'are', 'you', 'regards', 'edgar'] </code></pre>

If all you wanted to achieve was to parse a string containing a standard-format email, then use the <code>email.parser</code> module; it is part of the standard library. You'll still need to separate the emails in the larger text, but the <code>From: ...</code> header can help there, using a regular expression: <pre class="prettyprint"><code>import re from email import parser, policy email_start = re.compile(r'(?<=\n)\n(?=From:\s+)') parser = parser.Parser(policy=policy.default) for email_text in email_start.split(text): message = parser.parsestr(email_text) to, from_ = message['to'], message['from'] body = message.get_payload() # do something with the email details </code></pre> The regular expression matches any newline character that is directly preceded by another newline (so there is an empty line), followed by the text <code>From:</code> and at least one space (so the next line looks like an email <code>From:</code> header). Trying to get those same parts by removing or replacing punctuation is not a very effective method of getting the same information, even when you use the tools correctly. Demo: <pre class="prettyprint"><code>>>> import re >>> from email import parser, policy >>> email_start = re.compile(r'(?<=\n)\n(?=From:\s+)') >>> parser = parser.Parser(policy=policy.default) >>> for email_text in email_start.split(text): ... message = parser.parsestr(email_text) ... to, from_ = message['to'], message['from'] ... body = message.get_payload() ... print('Email from:', from_) ... print('Email to:', to) ... print('Third line:', body.splitlines(True)[2]) ... Email from: 'Mark Twain' <mark.twain@gmail.com> Email to: 'Edgar Allen Poe' <eap@gmail.com> Third line: I just read the Tell Tale Heart. You've got problems man. Email from: 'Edgar Allen Poe' <eap@gmail.com> Email to: 'Mark Twain' <mark.twain@gmail.com> Third line: The world is crushing my soul, and so are you. </code></pre>

Cleaning email chain for text analysis python

Tags:

python

text

I've got some text:

text = """From: 'Mark Twain' <[email protected]>
To: 'Edgar Allen Poe' <[email protected]>
Subject: RE:Hello!

Ed,

I just read the Tell Tale Heart. You\'ve got problems man.

Sincerely,
Marky Mark

From: 'Edgar Allen Poe' <[email protected]>
To: 'Mark Twain' <[email protected]>
Subject: RE: Hello!

Mark,

The world is crushing my soul, and so are you.

Regards,
Edgar"""

Which looks like this:

"From: 'Mark Twain' <[email protected]>\nTo: 'Edgar Allen Poe' <[email protected]>\nSubject: RE:Hello!\n\nEd,\n\nI just read the Tell Tale Heart. You've got problems man.\n\nSincerely,\nMarky Mark\n\nFrom: 'Edgar Allen Poe' <[email protected]>\nTo: 'Mark Twain' <[email protected]>\nSubject: RE: Hello!\n\nMark,\n\nThe world is crushing my soul, and so are you.\n\nRegards,\nEdgar"

I'm trying to parse out the messages within them. Ultimately I'd like to have a list or dictionary where I have the From and To, and then the message body with which to do some analysis on.

I've tried parsing it out by turning everything lower, and then string splitting.

text = text.lower()
text = text.translate(string.punctuation)
text_list = text.split('+')
text_list = [x for x in text_list if len(x) != 0]

is there a better way to do this?

577

asked Aug 03 '18 15:08

Matt W.

3 Answers

You can use re to split messages (explanation of this regexp on external site). The result is list of dicts with keys 'from', 'to', 'subject' and 'message':

text = """From: 'Mark Twain' <[email protected]>
To: 'Edgar Allen Poe' <[email protected]>
Subject: RE:Hello!

Ed,

I just read the Tell Tale Heart. You\'ve got problems man.

Sincerely,
Marky Mark

From: 'Edgar Allen Poe' <[email protected]>
To: 'Mark Twain' <[email protected]>
Subject: RE: Hello!

Mark,

The world is crushing my soul, and so are you.

Regards,
Edgar"""

import re
from pprint import pprint

groups = re.findall(r'^From:(.*?)To:(.*?)Subject:(.*?)$(.*?)(?=^From:|\Z)', text, flags=re.DOTALL|re.M)
emails = []
for g in groups:
    d = {}
    d['from'] = g[0].strip()
    d['to'] = g[1].strip()
    d['subject'] = g[2].strip()
    d['message'] = g[3].strip()
    emails.append(d)

pprint(emails)

Prints:

[{'from': "'Mark Twain' <[email protected]>",
  'message': 'Ed,\n'
             '\n'
             "I just read the Tell Tale Heart. You've got problems man.\n"
             '\n'
             'Sincerely,\n'
             'Marky Mark',
  'subject': 'RE:Hello!',
  'to': "'Edgar Allen Poe' <[email protected]>"},
 {'from': "'Edgar Allen Poe' <[email protected]>",
  'message': 'Mark,\n'
             '\n'
             'The world is crushing my soul, and so are you.\n'
             '\n'
             'Regards,\n'
             'Edgar',
  'subject': 'RE: Hello!',
  'to': "'Mark Twain' <[email protected]>"}]

109

answered Sep 23 '22 05:09

Andrej Kesely

That's not how str.translate works. Your text.translate(string.punctuation) uses the punctuation chars as a translation table, so it maps '\n', which is codepoint 10 to the 10th char in string.punctuation, which is '+'. The usual way to use str.translate is to first create a translation table using str.maketrans, which lets you specify chars to map from, the corresponding chars to map to, and (optionally) chars to delete. If you just want to use it for deletion you can create the table using dict.fromkeys, eg

table = dict.fromkeys([ord(c) for c in string.punctuation])

which makes a dict associating the codepoint of each char in string.punctuation to None.

Here's a repaired version of your code that uses str.translate to perform the case conversion and the punctuation deletion in a single step.

# Map upper case to lower case & remove punctuation
table = str.maketrans(string.ascii_uppercase, 
    string.ascii_lowercase, string.punctuation)

text = text.translate(table)
text_list = text.split('\n')
for row in text_list:
    print(repr(row))

output

'from mark twain marktwaingmailcom'
'to edgar allen poe eapgmailcom'
'subject rehello'
''
'ed'
''
'i just read the tell tale heart youve got problems man'
''
'sincerely'
'marky mark'
''
'from edgar allen poe eapgmailcom'
'to mark twain marktwaingmailcom'
'subject re hello'
''
'mark'
''
'the world is crushing my soul and so are you'
''
'regards'
'edgar'

However, simply deleting all the punctuation is a bit messy, since it joins some words that you may not want joined. Instead, we can translate each punctuation char to a space, and then split on whitespace:

# Map all punctuation to space
table = dict.fromkeys([ord(c) for c in string.punctuation], ' ')
text = text.translate(table).lower()
text_list = text.split()
print(text_list)

output

['from', 'mark', 'twain', 'mark', 'twain', 'gmail', 'com', 'to', 'edgar', 'allen', 'poe', 'eap', 'gmail', 'com', 'subject', 're', 'hello', 'ed', 'i', 'just', 'read', 'the', 'tell', 'tale', 'heart', 'you', 've', 'got', 'problems', 'man', 'sincerely', 'marky', 'mark', 'from', 'edgar', 'allen', 'poe', 'eap', 'gmail', 'com', 'to', 'mark', 'twain', 'mark', 'twain', 'gmail', 'com', 'subject', 're', 'hello', 'mark', 'the', 'world', 'is', 'crushing', 'my', 'soul', 'and', 'so', 'are', 'you', 'regards', 'edgar']

answered Sep 25 '22 05:09

PM 2Ring

If all you wanted to achieve was to parse a string containing a standard-format email, then use the email.parser module; it is part of the standard library.

You'll still need to separate the emails in the larger text, but the From: ... header can help there, using a regular expression:

import re
from email import parser, policy

email_start = re.compile(r'(?<=\n)\n(?=From:\s+)')

parser = parser.Parser(policy=policy.default)

for email_text in email_start.split(text):
    message = parser.parsestr(email_text)
    to, from_ = message['to'], message['from']
    body = message.get_payload()
    # do something with the email details

The regular expression matches any newline character that is directly preceded by another newline (so there is an empty line), followed by the text From: and at least one space (so the next line looks like an email From: header).

Trying to get those same parts by removing or replacing punctuation is not a very effective method of getting the same information, even when you use the tools correctly.

Demo:

>>> import re
>>> from email import parser, policy
>>> email_start = re.compile(r'(?<=\n)\n(?=From:\s+)')
>>> parser = parser.Parser(policy=policy.default)
>>> for email_text in email_start.split(text):
...     message = parser.parsestr(email_text)
...     to, from_ = message['to'], message['from']
...     body = message.get_payload()
...     print('Email from:', from_)
...     print('Email to:', to)
...     print('Third line:', body.splitlines(True)[2])
...
Email from: 'Mark Twain' <[email protected]>
Email to: 'Edgar Allen Poe' <[email protected]>
Third line: I just read the Tell Tale Heart. You've got problems man.

Email from: 'Edgar Allen Poe' <[email protected]>
Email to: 'Mark Twain' <[email protected]>
Third line: The world is crushing my soul, and so are you.

answered Sep 22 '22 05:09

Martijn Pieters

Related questions
                            
                                Is typed implicit conversion (coercion) in Python 3.x possible?
                            
                                How do I set the PYTHONUTF8 environment variable to enable UTF-8 encoding by default in Python?
                            
                                “Could not run curl-config: [Errno 2] No such file or directory” when installing pycurl on Alpine Linux
                            
                                Save tensors as images in TensorFlow
                            
                                pyinstaller Recursion error: maximum recursion depth exceeded
                            
                                Regex to match capital/special/unicode/vietnamese characters
                            
                                How to specify a directory in which to save an image using plotly py.image.save_as
                            
                                Auto increment version number in a Python webserver, with git
                            
                                How can I write my own decorator in Django?
                            
                                Vectorizing calculation in matrix with interdependent values
                            
                                plotly: TypeError: cannot convert dictionary update sequence element #0 to a sequence
                            
                                Google DataFlow/Python: Import errors with save_main_session and custom modules in __main__
                            
                                Scikit-learn how to check if model (e.g. TfidfVectorizer) has been already fit
                            
                                Differences between OtpionMenu and ComboBox in tkinter
                            
                                Pandas - Go through 2 columns (latitude and longitude) and find the distance between each coordinate and a specific place
                            
                                How rename pd.value_counts() index with a correspondance dictionary
                            
                                Find similar items in list of dictionaries based on values
                            
                                'module' object has no attribute 'lru_cache'
                            
                                Accuracy Stuck at 50% Keras
                            
                                Block Bootstrapped Sampling in Pandas

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With