Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

using JSoup to scrape emails and links

I want to use JSoup to extract all of the email addresses and URL's of a website and store it in a hashset(so there would be no repeats). I am attempting to do it but I am not exactly sure what exactly I need to put into the select or if I am doing it right. Here is the code:

Document doc = Jsoup.connect(link).get();

Elements URLS = doc.select("");
Elements emails = doc.select("");
emailSet.add(emails.toString());
linksToVisit.add(URLS.toString());
like image 458
Jonathan Avatar asked Sep 21 '25 10:09

Jonathan


2 Answers

Do like this:


Fetch the html document:

Document doc = Jsoup.connect(link).get();

Extract emails into a HashSet, using a regex to extract all the email addresses on the page:

Pattern p = Pattern.compile("[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\\.[a-zA-Z0-9-.]+");
Matcher matcher = p.matcher(doc.text());
Set<String> emails = new HashSet<String>();
while (matcher.find()) {
   emails.add(matcher.group());
}

Extract links:

Set<String> links = new HashSet<String>();

Elements elements = doc.select("a[href]");
for (Element e : elements) {
    links.add(e.attr("href"));
}

Complete and working code here: https://gist.github.com/JonasCz/a3b81def26ecc047ceb5

Now don't become a spammer !

like image 71
JonasCz Avatar answered Sep 23 '25 13:09

JonasCz


This is my working solution, it will search emails not only in text, but also in code:

public Set<String> getEmailsByUrl(String url) {
    Document doc;
    Set<String> emailSet = new HashSet<>();

    try {
        doc = Jsoup.connect(url)
                .userAgent("Mozilla")
                .get();

        Pattern p = Pattern.compile("[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\\.[a-zA-Z0-9-.]+");
        Matcher matcher = p.matcher(doc.body().html());
        while (matcher.find()) {
            emailSet.add(matcher.group());
        }
    } catch (IOException e) {
        e.printStackTrace();
    }

    return emailSet;
}
like image 36
excluzzzive Avatar answered Sep 23 '25 13:09

excluzzzive