Jan 23 00:46:24 portal postfix/smtp[31481]: 1B1653FEA1: to=<[email protected]>, relay=mta5.am0.yahoodns.net[98.138.112.35]:25, delay=5.4, delays=0.02/3.2/0.97/1.1, dsn=5.0.0, status=bounced (host mta5.am0.yahoodns.net[98.138.112.35] said: 554 delivery error: dd This user doesn't have a yahoo.com account ([email protected]) [0] - mta1321.mail.ne1.yahoo.com (in reply to end of DATA command))
Jan 23 00:46:24 portal postfix/smtp[31539]: AF40C3FE99: to=<[email protected]>, relay=mta7.am0.yahoodns.net[98.136.217.202]:25, delay=5.9, delays=0.01/3.1/0.99/1.8, dsn=5.0.0, status=bounced (host mta7.am0.yahoodns.net[98.136.217.202] said: 554 delivery error: dd This user doesn't have a yahoo.com account ([email protected]) [0] - mta1397.mail.gq1.yahoo.com (in reply to end of DATA command))
From above maillog I would like to extract the email addresses enclosed between the angular brackets < ... >
eg. to=<[email protected]>
to [email protected]
I am using cut -d' ' -f7
to extract emails but I am curious if there is a more flexible way.
Email Extractor is a simple little tool that will help you find email addresses hidden in a content. Just copy the entire block of text and paste it in the above input box. All you have to do is click on the “Extract Email” button, it will find all the email addresses present in your input text.
To extract emails form text, we can take of regular expression. In the below example we take help of the regular expression package to define the pattern of an email ID and then use the findall() function to retrieve those text which match this pattern.
With GNU grep, just use a regular expression containing a look behind and look ahead:
$ grep -Po '(?<=to=<).*(?=>)' file
[email protected]
[email protected]
This says: hey, extract all the strings preceded by to=<
and followed by >
.
You can use awk
like this:
awk -F'to=<|>,' '{print $2}' the.log
I'm splitting the line by to=<
or >,
and print the second field.
Just to show a sed
alternative (requires GNU or BSD/macOS sed
due to -E
):
sed -E 's/.* to=<(.*)>.*/\1/' file
Note how the regex must match the entire line so that the substitution of the capture-group match (the email address) yields only that match.
A slightly more efficient - but perhaps less readable - variation issed -E 's/.* to=<([^>]*).*/\1/' file
A POSIX-compliant formulation is a little more cumbersome due to the legacy syntax required by BREs (basic regular expressions):
sed 's/.* to=<\(.*\)>.*/\1/' file
A variation of fedorqui's helpful GNU grep
answer:
grep -Po ' to=<\K[^>]*' file
\K
, which drops everything matched up to that point, is not only syntactically simpler than a look-behind assertion ((?<=...)
, but also more flexible - it supports variable-length expressions - and is faster (though that may not matter in many real-world situations; if performance is paramount: see below).
Here's how the various solutions on this page compare in terms of performance.
Note that this may not matter much in many use cases, but gives insight into:
The absolute values are not important, but the relative performance hopefully provides some insight. See the bottom for the script that produced these numbers, which were obtained on a late-2012 27" iMac running macOS 10.12.3, using a 250,000-line input file created by replicating the sample input from the question, averaging the timings of 10 runs each.
Mawk 0.364s
GNU grep, \K, non-backtracking 0.392s
GNU awk 0.830s
GNU grep, \K 0.937s
GNU grep, (?>=...) 1.639s
BSD grep + cut 2.733s
GNU grep + cut 3.697s
BSD awk 3.785s
BSD sed, non-backtracking 7.825s
BSD sed 8.414s
GNU sed 16.738s
GNU sed, non-backtracking 17.387s
A few conclusions:
grep
is generally a good choice, even if it needs to be combined with cut
sed
is surprisingly slow, whereas GNU awk
is faster than BSD awk
. Strangely, the (partially) non-backtracking solution is slower with GNU sed
.Here's the script that produced the timings above; note that the g
-prefixed commands are GNU utilities that were installed on macOS via Homebrew; similarly, mawk
was installed via Homebrew.
Note that "non-backtracking" only applies partially to some of the commands.
#!/usr/bin/env bash
# Define the test commands.
test01=( 'BSD sed' sed -E 's/.*to=<(.*)>.*/\1/' )
test02=( 'BSD sed, non-backtracking' sed -E 's/.*to=<([^>]*).*/\1/' )
# ---
test03=( 'GNU sed' gsed -E 's/.*to=<(.*)>.*/\1/' )
test04=( 'GNU sed, non-backtracking' gsed -E 's/.*to=<([^>]*).*/\1/' )
# ---
test05=( 'BSD awk' awk -F' to=<|>,' '{print $2}' )
test06=( 'GNU awk' gawk -F' to=<|>,' '{print $2}' )
test07=( 'Mawk' mawk -F' to=<|>,' '{print $2}' )
#--
test08=( 'GNU grep, (?>=...)' ggrep -Po '(?<= to=<).*(?=>)' )
test09=( 'GNU grep, \K' ggrep -Po ' to=<\K.*(?=>)' )
test10=( 'GNU grep, \K, non-backtracking' ggrep -Po ' to=<\K[^>]*' )
# --
test11=( 'BSD grep + cut' "{ grep -o ' to=<[^>]*' | cut -d'<' -f2; }" )
test12=( 'GNU grep + cut' "{ ggrep -o ' to=<[^>]*' | gcut -d'<' -f2; }" )
# Determine input and output files.
inFile='file'
# NOTE: Do NOT use /dev/null, because GNU grep apparently takes a shortcut
# when it detects stdout going nowhere, which distorts the timings.
# Use dev/tty if you want to see stdout in the terminal (will print
# as a single block across all tests before the results are reported).
outFile="/tmp/out.$$"
# outFile='/dev/tty'
# Make `time` only report the overall elapsed time.
TIMEFORMAT='%6R'
# How many runs per test whose timings to average.
runs=10
# Read the input file up to even the playing field, so that the first command
# doesn't take the hit of being the first to load the file from disk.
echo "Warming up the cache..."
cat "$inFile" >/dev/null
# Run the tests.
echo "Running $(awk '{print NF}' <<<"${!test*}") test(s), averaging the timings of $runs run(s) each; this may take a while..."
{
for n in ${!test*}; do
arrRef="$n[@]"
test=( "${!arrRef}" )
# Print test description.
printf '%s\t' "${test[0]}"
# Execute test command.
if (( ${#test[@]} == 2 )); then # single-token command? assume `eval` must be used.
time for (( n = 0; n < runs; n++ )); do eval "${test[@]: 1}" < "$inFile" >"$outFile"; done
else # multiple command tokens? assume that they form a simple command that can be invoked directly.
time for (( n = 0; n < runs; n++ )); do "${test[@]: 1}" "$inFile" >"$outFile"; done
fi
done
} 2>&1 |
sort -t$'\t' -k2,2n |
awk -v runs="$runs" '
BEGIN{FS=OFS="\t"} { avg = sprintf("%.3f", $2/runs); print $1, avg "s" }
' | column -s$'\t' -t
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With