I want to read the text from a web page. I don't want to get the web page's HTML code. I found this code:
try {
// Create a URL for the desired page
URL url = new URL("http://www.uefa.com/uefa/aboutuefa/organisation/congress/news/newsid=1772321.html#uefa+moving+with+tide+history");
// Read all the text returned by the server
BufferedReader in = new BufferedReader(new InputStreamReader(url.openStream()));
String str;
while ((str = in.readLine()) != null) {
str = in.readLine().toString();
System.out.println(str);
// str is one line of text; readLine() strips the newline character(s)
}
in.close();
} catch (MalformedURLException e) {
} catch (IOException e) {
}
but this code gives me the HTML code of the web page. I want to get the whole text inside this page. How can I do this with Java?
Java has built-in tools and third-party libraries for reading/downloading web pages. In the examples, we use HttpClient, URL, JSoup, HtmlCleaner, Apache HttpClient, Jetty HttpClient, and HtmlUnit. In the following examples, we download HTML source from the webcode.me tiny web page.
You may want to have a look at jsoup for this:
String html = "<p>An <a href='http://example.com/'><b>example</b></a> link.</p>";
Document doc = Jsoup.parse(html);
String text = doc.body().text(); // "An example link"
This example is an extract from one on their site.
Use JSoup.
You will be able to parse the content using css style selectors.
In this example you can try
Document doc = Jsoup.connect("http://www.uefa.com/uefa/aboutuefa/organisation/congress/news/newsid=1772321.html#uefa+moving+with+tide+history").get();
String textContents = doc.select(".newsText").first().text();
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With