How to "scan" a website (or page) for info, and bring it into my program?

Tags:

Use a HTML parser like Jsoup. This has my preference above the other HTML parsers available in Java since it supports jQuery like CSS selectors. Also, its class representing a list of nodes, Elements, implements Iterable so that you can iterate over it in an enhanced for loop (so there's no need to hassle with verbose Node and NodeList like classes in the average Java DOM parser).

Here's a basic kickoff example (just put the latest Jsoup JAR file in classpath):

package com.stackoverflow.q2835505;

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

public class Test {

    public static void main(String[] args) throws Exception {
        String url = "https://stackoverflow.com/questions/2835505";
        Document document = Jsoup.connect(url).get();

        String question = document.select("#question .post-text").text();
        System.out.println("Question: " + question);

        Elements answerers = document.select("#answers .user-details a");
        for (Element answerer : answerers) {
            System.out.println("Answerer: " + answerer.text());
        }
    }

}

As you might have guessed, this prints your own question and the names of all answerers.

This is referred to as screen scraping, wikipedia has this article on the more specific web scraping. It can be a major challenge because there's some ugly, mess-up, broken-if-not-for-browser-cleverness HTML out there, so good luck.

I would use JTidy - it is simlar to JSoup, but I don't know JSoup well. JTidy handles broken HTML and returns a w3c Document, so you can use this as a source to XSLT to extract the content you are really interested in. If you don't know XSLT, then you might as well go with JSoup, as the Document model is nicer to work with than w3c.

EDIT: A quick look on the JSoup website shows that JSoup may indeed be the better choice. It seems to support CSS selectors out the box for extracting stuff from the document. This may be a lot easier to work with than getting into XSLT.

You may use an html parser (many useful links here: java html parser).

The process is called 'grabbing website content'. Search 'grab website content java' for further invertigation.

jsoup supports java 1.5

https://github.com/tburch/jsoup/commit/d8ea84f46e009a7f144ee414a9fa73ea187019a3

looks like that stack was a bug, and has been fixed

Related questions
                            
                                Joining a List<String> in Java with commas and "and"
                            
                                Should I add an @Override annotation when implementing abstract methods in Java?
                            
                                How to see Javadoc documentation on mouse hover in NetBeans?
                            
                                Programmatically select a row in JTable
                            
                                Maven dependency update on commandline
                            
                                Programmatically clicking a GUI button in Java Swing
                            
                                GsonBuilder setDateFormat for "2011-10-26T20:29:59-07:00"
                            
                                Map.Entry: How to use it?
                            
                                How to check if android checkbox is checked within its onClick method (declared in XML)?
                            
                                Null safe date comparator for sorting in Java 8 Stream
                            
                                Where's javax.servlet?
                            
                                java.sql.SQLException: Access denied for user 'root'@'localhost' (using password: YES)
                            
                                Manually call Spring Annotation Validation
                            
                                Recyclerview painfully slow to load cached images form Picasso
                            
                                Why does static have different meanings depending on the context? [duplicate]
                            
                                How do I include &, <, > etc in XML attribute values
                            
                                What is the equivalent of javascript setTimeout in Java?
                            
                                Enum values().length vs private field
                            
                                Connecting remote tomcat JMX instance using jConsole
                            
                                Why boolean in Java takes only true or false? Why not 1 or 0 also?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to "scan" a website (or page) for info, and bring it into my program?

Tags:

java

html

web-scraping

jsoup

Recent Activity

Donate For Us