I want to use JSoup to extract all of the email addresses and URL's of a website and store it in a hashset(so there would be no repeats). I am attempting to do it but I am not exactly sure what exactly I need to put into the select or if I am doing it right. Here is the code:
Document doc = Jsoup.connect(link).get();
Elements URLS = doc.select("");
Elements emails = doc.select("");
emailSet.add(emails.toString());
linksToVisit.add(URLS.toString());
Do like this:
Fetch the html document:
Document doc = Jsoup.connect(link).get();
Extract emails into a HashSet, using a regex to extract all the email addresses on the page:
Pattern p = Pattern.compile("[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\\.[a-zA-Z0-9-.]+");
Matcher matcher = p.matcher(doc.text());
Set<String> emails = new HashSet<String>();
while (matcher.find()) {
emails.add(matcher.group());
}
Extract links:
Set<String> links = new HashSet<String>();
Elements elements = doc.select("a[href]");
for (Element e : elements) {
links.add(e.attr("href"));
}
Complete and working code here: https://gist.github.com/JonasCz/a3b81def26ecc047ceb5
Now don't become a spammer !
This is my working solution, it will search emails not only in text, but also in code:
public Set<String> getEmailsByUrl(String url) {
Document doc;
Set<String> emailSet = new HashSet<>();
try {
doc = Jsoup.connect(url)
.userAgent("Mozilla")
.get();
Pattern p = Pattern.compile("[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\\.[a-zA-Z0-9-.]+");
Matcher matcher = p.matcher(doc.body().html());
while (matcher.find()) {
emailSet.add(matcher.group());
}
} catch (IOException e) {
e.printStackTrace();
}
return emailSet;
}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With