Given an email subject line, I'd like to clean it up, getting rid of the "Re:", "Fwd", and other junk. So, for example, "[Fwd] Re: Jack and Jill's Wedding" should turn into "Jack and Jill's Wedding".
Someone must've done this before, so I'm hoping you can point me to battle tested regex or code.
Here are some examples of what needs to be cleaned up, found on this page. The regex on that page works fairly well, but is not completely there.
Fwd : Re : Re: Many
Re : Re: Many
Re : : Re: Many
Re:: Many
Re; Many
: noah - should not match anything
RE--
RE: : Presidential Ballots for Florida
[RE: (no subject)]
Request - should not match anything
this is the subject (fwd)
Re: [Fwd: ] Blonde Joke
Re: [Fwd: [Fwd: FW: Policy]]
Re: Fwd: [Fwd: FW: "Drink Plenty of Water"]
FW: FW: (fwd) FW: Warning from XYZ...
FW: (Fwd) (Fwd)
Fwd: [Fwd: [Fwd: Big, Bad Surf Moving]]
FW: [Fwd: Fw: drawing by a school age child in PA (fwd)]
Re: Fwd
Try this one (replace with ''):
/([\[\(] *)?(RE|FWD?) *([-:;)\]][ :;\])-]*|$)|\]+ *$/igm
(If you put each subject through as its own string then you don't need the m
modifier; this is just so that $
matches end of line, not just end of string, for multiline string inputs).
See it in action here.
Explanation of regex:
([\[\(] *)? # starting [ or (, followed by optional spaces
(RE|FWD?) * # RE or FW or FWD, followed by optional spaces
([-:;)\]][ :;\])-]*|$) # only count it as a Re or FWD if it is followed by
# : or - or ; or ] or ) or end of line
# (and after that you can have more of these symbols with
# spaces in between)
| # OR
\]+ *$ # match any trailing \] at end of line
# (we assume the brackets () occur around a whole Re/Fwd
# but the square brackets [] occur around the whole
# subject line)
Flags.
i
: case insensitive.
g
: global match (match all the Re/Fwd you can find).
m
: let the '$' in the regex match end of line for a multiline input, not just end of string (only relevant if you feed in all your input subjects to the regex at once. If you feed in one subject each time then you can remove it because end of line is end of string).
Several variations (Subject Prefix) according to the country/language: Wikipedia: List of email subject abbreviations
Brazil: RES === RE, German: AW === RE
Example in Python:
#!/usr/local/bin/python
# -*- coding: utf-8 -*-
import re
p = re.compile( '([\[\(] *)?(RE?S?|FYI|RIF|I|FS|VB|RV|ENC|ODP|PD|YNT|ILT|SV|VS|VL|AW|WG|ΑΠ|ΣΧΕΤ|ΠΡΘ|תגובה|הועבר|主题|转发|FWD?) *([-:;)\]][ :;\])-]*|$)|\]+ *$', re.IGNORECASE)
print p.sub( '', 'RE: Tagon8 Inc.').strip()
Example in PHP:
$subject = "主题: Tagon8 - test php";
$subject = preg_replace("/([\[\(] *)?(RE?S?|FYI|RIF|I|FS|VB|RV|ENC|ODP|PD|YNT|ILT|SV|VS|VL|AW|WG|ΑΠ|ΣΧΕΤ|ΠΡΘ|תגובה|הועבר|主题|转发|FWD?) *([-:;)\]][ :;\])-]*|$)|\]+ *$/im", '', $subject);
var_dump(trim($subject));
Terminal:
$ python test.py
Tagon8 Inc.
$ php test.php
string(17) "Tagon8 - test php"
Note: This is the Regular Expression of mathematical.coffee. Added other prefixes from other languages: Chinese, Danish Norwegian, Finnish, French, German, Greek, Hebrew, Italian, Icelandic, Swedish, Portuguese, Polish, Turkish
I used "strip/trim" to remove spaces
The following regex will match all of the cases in the way that I would expect it to do so. I'm not sure if you will agree, because not every case has been explicitly documented. It is almost certainly possible to simplify this, but it is functional:
/^((\[(re|fw(d)?)\s*\]|[\[]?(re|fw(d)?))\s*[\:\;]\s*([\]]\s?)*|\(fw(d)?\)\s*)*([^\[\]]*)[\]]*/i
The final result in the match will be the stripped subject.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With