Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Extract email addresses from a website using scripts

Tags:

bash

email

web

Given a website, I wonder what is the best procedure, programmatically and/or using scripts, to extract all email addresses that are present on each page in plain text in the form [email protected] from that link and all sites underneath, recursively or until some fixed depth.

like image 644
Open the way Avatar asked Dec 13 '12 10:12

Open the way


2 Answers

Using shell programming you can achieve your goal using 2 programs piped together:

  • wget: will get all pages
  • grep: will filter and give you only the emails

An example:

wget -q -r -l 5 -O - http://somesite.com/ | grep -E -o "\b[a-zA-Z0-9.-]+@[a-zA-Z0-9.-]+\.[a-zA-Z0-9.-]+\b"

wget, in quiet mode (-q), is getting all pages recursively (-r) with maximum depth level of 5 (-l 5) from somesite.com.br and printing everything to stdout (-O -).

grep is using an extended regular expression (-E) and showing only (-o) email address.

All emails are going to be printed to standard output and you can write them to a file by appending > somefile.txt to the command.

Read the man pages for more documentation on wget and grep.

This example was tested with GNU bash version 4.2.37(1)-release, GNU grep 2.12 and GNU Wget 1.13.4.

like image 197
roq Avatar answered Sep 21 '22 15:09

roq


First use wget to recursively download pages from the URL. The -l option is the recusion depth, set to 1 below:

$ mkdir site
$ cd site
$ wget -q -r -l1  http://www.foobar.com

Then run a recursive grep to extract the email addresses. (The regex below is not perfect and may need to be tweaked if you find that not all addresses are being picked up.)

$ grep -hrio "\b[a-z0-9.-]\+@[a-z0-9.-]\+\.[a-z]\{2,4\}\+\b" *

As an aside, wget does have an option (-O -) to print downloaded content to stdout instead of saving it to disk but, unfortunately, it does not work in recursive (-r) mode.

like image 40
dogbane Avatar answered Sep 19 '22 15:09

dogbane