Extract email addresses from log with grep or sed

Tags:

Jan 23 00:46:24 portal postfix/smtp[31481]: 1B1653FEA1: to=<[email protected]>, relay=mta5.am0.yahoodns.net[98.138.112.35]:25, delay=5.4, delays=0.02/3.2/0.97/1.1, dsn=5.0.0, status=bounced (host mta5.am0.yahoodns.net[98.138.112.35] said: 554 delivery error: dd This user doesn't have a yahoo.com account ([email protected]) [0] - mta1321.mail.ne1.yahoo.com (in reply to end of DATA command))
Jan 23 00:46:24 portal postfix/smtp[31539]: AF40C3FE99: to=<[email protected]>, relay=mta7.am0.yahoodns.net[98.136.217.202]:25, delay=5.9, delays=0.01/3.1/0.99/1.8, dsn=5.0.0, status=bounced (host mta7.am0.yahoodns.net[98.136.217.202] said: 554 delivery error: dd This user doesn't have a yahoo.com account ([email protected]) [0] - mta1397.mail.gq1.yahoo.com (in reply to end of DATA command))

From above maillog I would like to extract the email addresses enclosed between the angular brackets < ... > eg. to=<[email protected]> to [email protected]

I am using cut -d' ' -f7 to extract emails but I am curious if there is a more flexible way.

490

asked Jan 26 '17 11:01

3 Answers

With GNU grep, just use a regular expression containing a look behind and look ahead:

$ grep -Po '(?<=to=<).*(?=>)' file
[email protected]
[email protected]

This says: hey, extract all the strings preceded by to=< and followed by >.

192

answered Oct 05 '22 16:10

fedorqui 'SO stop harming'

You can use awk like this:

awk -F'to=<|>,' '{print $2}' the.log

I'm splitting the line by to=< or >, and print the second field.

answered Oct 03 '22 16:10

hek2mgl

Just to show a sed alternative (requires GNU or BSD/macOS sed due to -E):

sed -E 's/.* to=<(.*)>.*/\1/' file

Note how the regex must match the entire line so that the substitution of the capture-group match (the email address) yields only that match.

A slightly more efficient - but perhaps less readable - variation is
sed -E 's/.* to=<([^>]*).*/\1/' file

A POSIX-compliant formulation is a little more cumbersome due to the legacy syntax required by BREs (basic regular expressions):

sed 's/.* to=<\(.*\)>.*/\1/' file

A variation of fedorqui's helpful GNU grep answer:

grep -Po ' to=<\K[^>]*' file

\K, which drops everything matched up to that point, is not only syntactically simpler than a look-behind assertion ((?<=...), but also more flexible - it supports variable-length expressions - and is faster (though that may not matter in many real-world situations; if performance is paramount: see below).

Performance comparison

Here's how the various solutions on this page compare in terms of performance.

Note that this may not matter much in many use cases, but gives insight into:

the relative performance of the various standard utilities
for a given utility, how tweaking the regex can make a difference.

The absolute values are not important, but the relative performance hopefully provides some insight. See the bottom for the script that produced these numbers, which were obtained on a late-2012 27" iMac running macOS 10.12.3, using a 250,000-line input file created by replicating the sample input from the question, averaging the timings of 10 runs each.

Mawk                            0.364s
GNU grep, \K, non-backtracking  0.392s
GNU awk                         0.830s
GNU grep, \K                    0.937s
GNU grep, (?>=...)              1.639s
BSD grep + cut                  2.733s
GNU grep + cut                  3.697s
BSD awk                         3.785s
BSD sed, non-backtracking       7.825s
BSD sed                         8.414s
GNU sed                         16.738s
GNU sed, non-backtracking       17.387s

A few conclusions:

The specific implementation of a given utility matters.
grep is generally a good choice, even if it needs to be combined with cut
Tweaking the regex to avoid backtracking and look-behind assertions can make a difference.
GNU sed is surprisingly slow, whereas GNU awk is faster than BSD awk. Strangely, the (partially) non-backtracking solution is slower with GNU sed.

Here's the script that produced the timings above; note that the g-prefixed commands are GNU utilities that were installed on macOS via Homebrew; similarly, mawk was installed via Homebrew.

Note that "non-backtracking" only applies partially to some of the commands.

#!/usr/bin/env bash

# Define the test commands.
test01=( 'BSD sed'                        sed -E 's/.*to=<(.*)>.*/\1/' )
test02=( 'BSD sed, non-backtracking'      sed -E 's/.*to=<([^>]*).*/\1/' )
# ---
test03=( 'GNU sed'                        gsed -E 's/.*to=<(.*)>.*/\1/' )
test04=( 'GNU sed, non-backtracking'      gsed -E 's/.*to=<([^>]*).*/\1/' )
# ---
test05=( 'BSD awk'                        awk  -F' to=<|>,' '{print $2}' )
test06=( 'GNU awk'                        gawk -F' to=<|>,' '{print $2}' )
test07=( 'Mawk'                           mawk -F' to=<|>,' '{print $2}' )
#--
test08=( 'GNU grep, (?>=...)'             ggrep -Po '(?<= to=<).*(?=>)' )
test09=( 'GNU grep, \K'                   ggrep -Po ' to=<\K.*(?=>)' )
test10=( 'GNU grep, \K, non-backtracking' ggrep -Po ' to=<\K[^>]*' )
# --
test11=( 'BSD grep + cut'                 "{ grep -o  ' to=<[^>]*' | cut  -d'<' -f2; }" )
test12=( 'GNU grep + cut'                 "{ ggrep -o ' to=<[^>]*' | gcut -d'<' -f2; }" )

# Determine input and output files.
inFile='file'
# NOTE: Do NOT use /dev/null, because GNU grep apparently takes a shortcut
#       when it detects stdout going nowhere, which distorts the timings.
#       Use dev/tty if you want to see stdout in the terminal (will print
#       as a single block across all tests before the results are reported).
outFile="/tmp/out.$$"
# outFile='/dev/tty'

# Make `time` only report the overall elapsed time.
TIMEFORMAT='%6R'

# How many runs per test whose timings to average.
runs=10

# Read the input file up to even the playing field, so that the first command
# doesn't take the hit of being the first to load the file from disk.
echo "Warming up the cache..."
cat "$inFile" >/dev/null

# Run the tests.
echo "Running $(awk '{print NF}' <<<"${!test*}") test(s), averaging the timings of $runs run(s) each; this may take a while..."
{
    for n in ${!test*}; do    
        arrRef="$n[@]"
        test=( "${!arrRef}" )
        # Print test description.
        printf '%s\t' "${test[0]}"
        # Execute test command.
        if (( ${#test[@]} == 2 )); then # single-token command? assume `eval` must be used.
          time for (( n = 0; n < runs; n++ )); do eval "${test[@]: 1}" < "$inFile" >"$outFile"; done
        else # multiple command tokens? assume that they form a simple command that can be invoked directly.
          time for (( n = 0; n < runs; n++ )); do "${test[@]: 1}" "$inFile" >"$outFile"; done
        fi
    done
} 2>&1 | 
  sort -t$'\t' -k2,2n | 
    awk -v runs="$runs" '
      BEGIN{FS=OFS="\t"} { avg = sprintf("%.3f", $2/runs); print $1, avg "s" }
    ' | column -s$'\t' -t

answered Oct 02 '22 16:10

mklement0

Related questions
                            
                                R: (*SKIP)(*FAIL) for multiple patterns
                            
                                find words of length 4 using regular expression
                            
                                python RE findall() return value is an entire string
                            
                                Creating regex to extract 4 digit number from string using java
                            
                                How can I convert Degree minute sec to Decimal in R?
                            
                                Switch/case statement
                            
                                Microsoft Edge regex for user agent
                            
                                List files on HTTP/FTP server in R
                            
                                Regex match from start label until empty line or end label
                            
                                Can I improve performance of this regular expression further
                            
                                Replace string containing $& in JavaScript regex
                            
                                Regex 4 non consecutive and no repeated digits
                            
                                Issues with ESLint "max-len" ignore pattern
                            
                                Pandas - filter and regex search the index of DataFrame
                            
                                Regex to add leading zero in date record
                            
                                Regex Match a character which is not followed by another specific character
                            
                                Regex capture order: wrong alternative matched after greedy pattern
                            
                                Java - regex for ordinary positive negative number
                            
                                Which would be better non-greedy regex or negated character class?
                            
                                Split string before any first number

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Extract email addresses from log with grep or sed

Tags:

regex

grep

sed

awk

cut

sherpaurgen

People also ask