Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Java - Searching For Data within a Website

Tags:

java

I'm new to java and having some problems.

The main idea is to connect to a website and collect information off it and store it in an array.

What I want the program to do is to search the website find a key word, and store what comes after the key word..

on the front page of daniweb along the bottom of the website there is a section called "Tag Cloud" which is filled with tags / short words

Tag Cloud: "i want to store what is written here"

My idea is to first read in the html of the website and then search that file for the key word followed by the text using Scanner and StringTokenizer then store as a array.

is there a better way / easier?

where do you suggest i look for some examples

here is what i have so far.

import java.net.*;
import java.io.*;

public class URLReader {

    public static void main(String[] args) throws Exception {

        URL dweb = new URL("http://www.daniweb.com/");
        URLConnection dw = dweb.openConnection();
        BufferedReader in = new BufferedReader(new InputStreamReader(hc.getInputStream()));
        System.out.println("connected to daniweb");
        String inputLine;

        PrintStream out = new PrintStream(new FileOutputStream("OutFile.txt"));

        try {
        while ((inputLine = in.readLine()) != null)
            out.println(inputLine);

            //System.out.println(inputLine);
            //in.close();
        out.close();
        System.out.println("printed text to outfile");
        } catch (FileNotFoundException e) {
            e.printStackTrace();
        }

        try {
            Scanner scan = new Scanner(OutFile.txt);
            String search = txtSearch.getText();
            while (scan.hasNextLine()) {
                line = scan.nextLine();
            //still working
                while (st.hasMoreTokens()) {
                    word = st.nextToken();
                    if (word == search) {

                    } else {

                    }
                }
            }
            scan.close();
            SearchWin.dispose();
        } catch (IOException iox) {
        }
    }

any help at all would be very much appreciated!

like image 859
AdianDes Avatar asked Aug 25 '10 12:08

AdianDes


People also ask

Can Java interact with websites?

Java applications are offered through web browsers as either a web start application (which do not interact with the browser once they are launched) or as a Java applet (which might interact with the browser). This change does not affect Web Start applications, it only impacts applets.


2 Answers

I recommend jsoup. It will retrieve and parse the page for you.

On daniweb, each tag cloud link has the CSS class tagcloudlink. So you just need to tell jsoup to extract all text in tags that have the class tagcloudlink.

This is off the top of my head plus some help from the jsoup site; I haven't tested it but it should get you started:

List<String> tags = new ArrayList<String>();
Document doc = Jsoup.connect("http://daniweb.com/").get();
Elements taglinks = doc.select("a.tagcloudlink");
for (Element link : taglinks) {
    tags.add(link.text());
}
like image 75
Jeff Avatar answered Oct 12 '22 23:10

Jeff


You could use HTML Parser for this. Here is a link to it: HTML Parser. Another one I've used a lot and like is Jericho HTML Parser. Here is a link: Jericho HTML Parser

like image 31
Corv1nus Avatar answered Oct 13 '22 00:10

Corv1nus