I'm working on a project which needs to design a web crawler in Java which can take a user query about a particular news subject and then visit different news websites and then extract news content from those pages and store it in some files/databases. I need this to make a summary of overall stored content. I'm new to this field and so expect some help from people who have any experience how to do it.
Right now I have the code to extract news content from a single page which takes the page manually, but I have no idea how to integrate it in a web crawler to extract content from different pages.
Can anyone give some good links to tutorials or implementations in Java which I can use or modify according to my needs?
http://jsoup.org/
Document doc = Jsoup.connect("http://en.wikipedia.org/").get();
Elements newsHeadlines = doc.select("#mp-itn b a");
One word of advice in addition to the other answers - make sure that your crawler respects robots.txt
(i.e. does not crawl sites rapidly and indiscriminately) or you are likely to get yourself/your organisation blocked by the sites you want to visit.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With