Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to read a text from a web page with Java?

Tags:

java

I want to read the text from a web page. I don't want to get the web page's HTML code. I found this code:

    try {
        // Create a URL for the desired page
        URL url = new URL("http://www.uefa.com/uefa/aboutuefa/organisation/congress/news/newsid=1772321.html#uefa+moving+with+tide+history");       

        // Read all the text returned by the server
        BufferedReader in = new BufferedReader(new InputStreamReader(url.openStream()));
        String str;
        while ((str = in.readLine()) != null) {
            str = in.readLine().toString();
            System.out.println(str);
            // str is one line of text; readLine() strips the newline character(s)
        }
        in.close();
    } catch (MalformedURLException e) {
    } catch (IOException e) {
    }

but this code gives me the HTML code of the web page. I want to get the whole text inside this page. How can I do this with Java?

like image 821
Rigor Mortis Avatar asked Mar 22 '12 15:03

Rigor Mortis


People also ask

Can you read websites HTML with Java?

Java has built-in tools and third-party libraries for reading/downloading web pages. In the examples, we use HttpClient, URL, JSoup, HtmlCleaner, Apache HttpClient, Jetty HttpClient, and HtmlUnit. In the following examples, we download HTML source from the webcode.me tiny web page.


2 Answers

You may want to have a look at jsoup for this:

String html = "<p>An <a href='http://example.com/'><b>example</b></a> link.</p>";
Document doc = Jsoup.parse(html); 
String text = doc.body().text(); // "An example link"

This example is an extract from one on their site.

like image 56
Fabian Barney Avatar answered Sep 17 '22 21:09

Fabian Barney


Use JSoup.

You will be able to parse the content using css style selectors.

In this example you can try

Document doc = Jsoup.connect("http://www.uefa.com/uefa/aboutuefa/organisation/congress/news/newsid=1772321.html#uefa+moving+with+tide+history").get(); 
String textContents = doc.select(".newsText").first().text();
like image 33
Nitzan Volman Avatar answered Sep 19 '22 21:09

Nitzan Volman