How to design a web crawler in Java?

Question

I'm working on a project which needs to design a web crawler in Java which can take a user query about a particular news subject and then visit different news websites and then extract news content from those pages and store it in some files/databases. I need this to make a summary of overall stored content. I'm new to this field and so expect some help from people who have any experience how to do it.

Right now I have the code to extract news content from a single page which takes the page manually, but I have no idea how to integrate it in a web crawler to extract content from different pages.

Can anyone give some good links to tutorials or implementations in Java which I can use or modify according to my needs?

Juan Mendes · Accepted Answer

http://jsoup.org/

Document doc = Jsoup.connect("http://en.wikipedia.org/").get();
Elements newsHeadlines = doc.select("#mp-itn b a");

DNA · Answer

One word of advice in addition to the other answers - make sure that your crawler respects robots.txt (i.e. does not crawl sites rapidly and indiscriminately) or you are likely to get yourself/your organisation blocked by the sites you want to visit.

How to design a web crawler in Java?

Tags:

java

web-scraping

web-crawler

dark_shadow

2 Answers

Juan Mendes

DNA

Recent Activity

Donate For Us

How to design a web crawler in Java?

Tags:

java

web-scraping

web-crawler

dark_shadow

2 Answers

Juan Mendes

DNA

Related questions

Recent Activity

Donate For Us