Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do I get the source of a given URL from a servlet?

I want to read a source code (HTML tags) of a given URL from my servlet.

For example, URL is http://www.google.com and my servlet needs to read the HTML source code. Why I need this is, my web application is going to read other web pages and get useful content and do something with it.

Lets say, my application shows a shop list of one category in a city. How that list is generated is, my web application (servlet) goes through a given web page which is displaying various shops and read content. With the source code my servlet filters that source and get useful details. Finally creates the list (because my servlet has no access to the given URL's web applications database).

Any know any solution? (specially I need this to do in servlets) If do you think that there is another best way to get details from another site, please let me know.

Thank you

like image 965
Débora Avatar asked Dec 03 '22 07:12

Débora


2 Answers

You don't need servlet to read data from a remote server. You can just use java.net.URL or java.net.URLConnection class to read remote content from HTTP server. For example,

InputStream input = (InputStream) new URL("http://www.google.com").getContent();
like image 95
Andrey Adamovich Avatar answered Dec 21 '22 11:12

Andrey Adamovich


Take a look at jsoup for fetching and parsing the HTML.

Document doc = Jsoup.connect("http://en.wikipedia.org/").get();
Elements newsHeadlines = doc.select("#mp-itn b a");
like image 37
Jeremy Avatar answered Dec 21 '22 11:12

Jeremy