I want to parse a simple web site and scrape information from that web site.
I used to parse XML files with DocumentBuilderFactory, i tried to do the same thing for the html file but it always get into an infinite loop.
    URL url = new URL("http://www.deneme.com");
    URLConnection uc = url.openConnection();
    InputStreamReader input = new InputStreamReader(uc.getInputStream());
    BufferedReader in = new BufferedReader(input);
    String inputLine;
     FileWriter outFile = new FileWriter("orhancan");
     PrintWriter out = new PrintWriter(outFile);
    while ((inputLine = in.readLine()) != null) {
        out.println(inputLine);
    }
    in.close();
    out.close();
    File fXmlFile = new File("orhancan");
    DocumentBuilderFactory dbFactory = DocumentBuilderFactory.newInstance();
    DocumentBuilder dBuilder = dbFactory.newDocumentBuilder();
    Document doc = dBuilder.parse(fXmlFile);
    NodeList prelist = doc.getElementsByTagName("body");
    System.out.println(prelist.getLength());
Whats is the problem? Or is there any easier way to scrape data from a web site for a given html tag?
There is a much easier way to do this. I suggest using JSoup. With JSoup you can do things like
Document doc = Jsoup.connect("http://en.wikipedia.org/").get();
Elements newsHeadlines = doc.select("#mp-itn b a");
Or if you want the body:
Elements body = doc.select("body");
Or if you want all links:
Elements links = doc.select("body a");
You no longer need to get connections or handle streams. Simple. If you have ever used jQuery then it is very similar to that.
Definitely JSoup is the answer. ;-)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With