Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Extract email addresses from log with grep or sed

Tags:

regex

grep

sed

awk

cut

Jan 23 00:46:24 portal postfix/smtp[31481]: 1B1653FEA1: to=<[email protected]>, relay=mta5.am0.yahoodns.net[98.138.112.35]:25, delay=5.4, delays=0.02/3.2/0.97/1.1, dsn=5.0.0, status=bounced (host mta5.am0.yahoodns.net[98.138.112.35] said: 554 delivery error: dd This user doesn't have a yahoo.com account ([email protected]) [0] - mta1321.mail.ne1.yahoo.com (in reply to end of DATA command))
Jan 23 00:46:24 portal postfix/smtp[31539]: AF40C3FE99: to=<[email protected]>, relay=mta7.am0.yahoodns.net[98.136.217.202]:25, delay=5.9, delays=0.01/3.1/0.99/1.8, dsn=5.0.0, status=bounced (host mta7.am0.yahoodns.net[98.136.217.202] said: 554 delivery error: dd This user doesn't have a yahoo.com account ([email protected]) [0] - mta1397.mail.gq1.yahoo.com (in reply to end of DATA command))

From above maillog I would like to extract the email addresses enclosed between the angular brackets < ... > eg. to=<[email protected]> to [email protected]

I am using cut -d' ' -f7 to extract emails but I am curious if there is a more flexible way.

like image 490
sherpaurgen Avatar asked Jan 26 '17 11:01

sherpaurgen


People also ask

How do I extract email addresses from a text file?

Email Extractor is a simple little tool that will help you find email addresses hidden in a content. Just copy the entire block of text and paste it in the above input box. All you have to do is click on the “Extract Email” button, it will find all the email addresses present in your input text.

How do I extract email addresses from a string in Python?

To extract emails form text, we can take of regular expression. In the below example we take help of the regular expression package to define the pattern of an email ID and then use the findall() function to retrieve those text which match this pattern.


3 Answers

With GNU grep, just use a regular expression containing a look behind and look ahead:

$ grep -Po '(?<=to=<).*(?=>)' file
[email protected]
[email protected]

This says: hey, extract all the strings preceded by to=< and followed by >.

like image 192
fedorqui 'SO stop harming' Avatar answered Oct 05 '22 16:10

fedorqui 'SO stop harming'


You can use awk like this:

awk -F'to=<|>,' '{print $2}' the.log

I'm splitting the line by to=< or >, and print the second field.

like image 27
hek2mgl Avatar answered Oct 03 '22 16:10

hek2mgl


Just to show a sed alternative (requires GNU or BSD/macOS sed due to -E):

sed -E 's/.* to=<(.*)>.*/\1/' file

Note how the regex must match the entire line so that the substitution of the capture-group match (the email address) yields only that match.

A slightly more efficient - but perhaps less readable - variation is
sed -E 's/.* to=<([^>]*).*/\1/' file


A POSIX-compliant formulation is a little more cumbersome due to the legacy syntax required by BREs (basic regular expressions):

sed 's/.* to=<\(.*\)>.*/\1/' file

A variation of fedorqui's helpful GNU grep answer:

grep -Po ' to=<\K[^>]*' file

\K, which drops everything matched up to that point, is not only syntactically simpler than a look-behind assertion ((?<=...), but also more flexible - it supports variable-length expressions - and is faster (though that may not matter in many real-world situations; if performance is paramount: see below).


Performance comparison

Here's how the various solutions on this page compare in terms of performance.

Note that this may not matter much in many use cases, but gives insight into:

  • the relative performance of the various standard utilities
  • for a given utility, how tweaking the regex can make a difference.

The absolute values are not important, but the relative performance hopefully provides some insight. See the bottom for the script that produced these numbers, which were obtained on a late-2012 27" iMac running macOS 10.12.3, using a 250,000-line input file created by replicating the sample input from the question, averaging the timings of 10 runs each.

Mawk                            0.364s
GNU grep, \K, non-backtracking  0.392s
GNU awk                         0.830s
GNU grep, \K                    0.937s
GNU grep, (?>=...)              1.639s
BSD grep + cut                  2.733s
GNU grep + cut                  3.697s
BSD awk                         3.785s
BSD sed, non-backtracking       7.825s
BSD sed                         8.414s
GNU sed                         16.738s
GNU sed, non-backtracking       17.387s

A few conclusions:

  • The specific implementation of a given utility matters.
  • grep is generally a good choice, even if it needs to be combined with cut
  • Tweaking the regex to avoid backtracking and look-behind assertions can make a difference.
  • GNU sed is surprisingly slow, whereas GNU awk is faster than BSD awk. Strangely, the (partially) non-backtracking solution is slower with GNU sed.

Here's the script that produced the timings above; note that the g-prefixed commands are GNU utilities that were installed on macOS via Homebrew; similarly, mawk was installed via Homebrew.

Note that "non-backtracking" only applies partially to some of the commands.

#!/usr/bin/env bash

# Define the test commands.
test01=( 'BSD sed'                        sed -E 's/.*to=<(.*)>.*/\1/' )
test02=( 'BSD sed, non-backtracking'      sed -E 's/.*to=<([^>]*).*/\1/' )
# ---
test03=( 'GNU sed'                        gsed -E 's/.*to=<(.*)>.*/\1/' )
test04=( 'GNU sed, non-backtracking'      gsed -E 's/.*to=<([^>]*).*/\1/' )
# ---
test05=( 'BSD awk'                        awk  -F' to=<|>,' '{print $2}' )
test06=( 'GNU awk'                        gawk -F' to=<|>,' '{print $2}' )
test07=( 'Mawk'                           mawk -F' to=<|>,' '{print $2}' )
#--
test08=( 'GNU grep, (?>=...)'             ggrep -Po '(?<= to=<).*(?=>)' )
test09=( 'GNU grep, \K'                   ggrep -Po ' to=<\K.*(?=>)' )
test10=( 'GNU grep, \K, non-backtracking' ggrep -Po ' to=<\K[^>]*' )
# --
test11=( 'BSD grep + cut'                 "{ grep -o  ' to=<[^>]*' | cut  -d'<' -f2; }" )
test12=( 'GNU grep + cut'                 "{ ggrep -o ' to=<[^>]*' | gcut -d'<' -f2; }" )

# Determine input and output files.
inFile='file'
# NOTE: Do NOT use /dev/null, because GNU grep apparently takes a shortcut
#       when it detects stdout going nowhere, which distorts the timings.
#       Use dev/tty if you want to see stdout in the terminal (will print
#       as a single block across all tests before the results are reported).
outFile="/tmp/out.$$"
# outFile='/dev/tty'

# Make `time` only report the overall elapsed time.
TIMEFORMAT='%6R'

# How many runs per test whose timings to average.
runs=10

# Read the input file up to even the playing field, so that the first command
# doesn't take the hit of being the first to load the file from disk.
echo "Warming up the cache..."
cat "$inFile" >/dev/null

# Run the tests.
echo "Running $(awk '{print NF}' <<<"${!test*}") test(s), averaging the timings of $runs run(s) each; this may take a while..."
{
    for n in ${!test*}; do    
        arrRef="$n[@]"
        test=( "${!arrRef}" )
        # Print test description.
        printf '%s\t' "${test[0]}"
        # Execute test command.
        if (( ${#test[@]} == 2 )); then # single-token command? assume `eval` must be used.
          time for (( n = 0; n < runs; n++ )); do eval "${test[@]: 1}" < "$inFile" >"$outFile"; done
        else # multiple command tokens? assume that they form a simple command that can be invoked directly.
          time for (( n = 0; n < runs; n++ )); do "${test[@]: 1}" "$inFile" >"$outFile"; done
        fi
    done
} 2>&1 | 
  sort -t$'\t' -k2,2n | 
    awk -v runs="$runs" '
      BEGIN{FS=OFS="\t"} { avg = sprintf("%.3f", $2/runs); print $1, avg "s" }
    ' | column -s$'\t' -t
like image 35
mklement0 Avatar answered Oct 02 '22 16:10

mklement0