Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to design a web crawler in Java?

I'm working on a project which needs to design a web crawler in Java which can take a user query about a particular news subject and then visit different news websites and then extract news content from those pages and store it in some files/databases. I need this to make a summary of overall stored content. I'm new to this field and so expect some help from people who have any experience how to do it.

Right now I have the code to extract news content from a single page which takes the page manually, but I have no idea how to integrate it in a web crawler to extract content from different pages.

Can anyone give some good links to tutorials or implementations in Java which I can use or modify according to my needs?

like image 679
dark_shadow Avatar asked Dec 05 '22 17:12

dark_shadow


2 Answers

http://jsoup.org/

Document doc = Jsoup.connect("http://en.wikipedia.org/").get();
Elements newsHeadlines = doc.select("#mp-itn b a");
like image 173
Juan Mendes Avatar answered Dec 19 '22 01:12

Juan Mendes


One word of advice in addition to the other answers - make sure that your crawler respects robots.txt (i.e. does not crawl sites rapidly and indiscriminately) or you are likely to get yourself/your organisation blocked by the sites you want to visit.

like image 32
DNA Avatar answered Dec 19 '22 01:12

DNA