Given a website, I wonder what is the best procedure, programmatically and/or using scripts, to extract all email addresses that are present on each page in plain text in the form [email protected] from that link and all sites underneath, recursively or until some fixed depth.
Using shell programming you can achieve your goal using 2 programs piped together:
An example:
wget -q -r -l 5 -O - http://somesite.com/ | grep -E -o "\b[a-zA-Z0-9.-]+@[a-zA-Z0-9.-]+\.[a-zA-Z0-9.-]+\b"
wget, in quiet mode (-q), is getting all pages recursively (-r) with maximum depth level of 5 (-l 5) from somesite.com.br and printing everything to stdout (-O -).
grep is using an extended regular expression (-E) and showing only (-o) email address.
All emails are going to be printed to standard output and you can write them to a file by appending > somefile.txt
to the command.
Read the man
pages for more documentation on wget and grep.
This example was tested with GNU bash version 4.2.37(1)-release, GNU grep 2.12 and GNU Wget 1.13.4.
First use wget
to recursively download pages from the URL. The -l
option is the recusion depth, set to 1
below:
$ mkdir site
$ cd site
$ wget -q -r -l1 http://www.foobar.com
Then run a recursive grep
to extract the email addresses. (The regex below is not perfect and may need to be tweaked if you find that not all addresses are being picked up.)
$ grep -hrio "\b[a-z0-9.-]\+@[a-z0-9.-]\+\.[a-z]\{2,4\}\+\b" *
As an aside, wget
does have an option (-O -
) to print downloaded content to stdout instead of saving it to disk but, unfortunately, it does not work in recursive (-r
) mode.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With