Why is website crawling taking forever?

Tags:

public class Parser {

    public static void main(String[] args) {
        Parser p = new Parser();
        p.matchString();
    }

    parserObject courseObject = new parserObject();
    ArrayList<parserObject> courseObjects = new ArrayList<parserObject>();
    ArrayList<String> courseNames = new ArrayList<String>();
    String theWebPage = " ";

    {
        try {
            URL theUrl = new URL("http://ocw.mit.edu/courses/");
            BufferedReader reader =
                new BufferedReader(new InputStreamReader(theUrl.openStream()));
            String str = null;

            while((str = reader.readLine()) != null) {
                theWebPage = theWebPage + " " + str;
            }
            reader.close();

        } catch (MalformedURLException e) {
            // do nothing
        } catch (IOException e) {
            // do nothing
        }
    }

    public void matchString() {
        // this is my regex that I am using to compare strings on input page
        String matchRegex = "#\\w+(-\\w+)+";

        Pattern p = Pattern.compile(matchRegex);
        Matcher m = p.matcher(theWebPage);

        int i = 0;
        while (!m.hitEnd()) {
            try {
                System.out.println(m.group());
                courseNames.add(i, m.group());
                i++;
            } catch (IllegalStateException e) {
                // do nothing
            }
        }
    }
}

What I am trying to achieve with the above code is to get the list of departments on the MIT OpencourseWare website. I am using a regular expression that matches the pattern of the department names as in the page source. And I am using a Pattern object and a Matcher object and trying to find() and print these department names that match the regular expression. But the code is taking forever to run and I don't think reading in a webpage using bufferedReader takes that long. So I think I am either doing something horribly wrong or parsing websites takes a ridiculously long time. so I would appreciate any input on how to improve performance or correct a mistake in my code if any. I apologize for the badly written code.

1000

asked Aug 11 '12 01:08

anonuser0428

2 Answers

The problem is with the code

while ((str = reader.readLine()) != null)
    theWebPage = theWebPage + " " +str;

The variable theWebPage is a String, which is immutable. For each line read, this code creates a new String with a copy of everything that's been read so far, with a space and the just-read line appended. This is an extraordinary amount of unnecessary copying, which is why the program is running so slow.

I downloaded the web page in question. It has 55,000 lines and is about 3.25MB in size. Not too big. But because of the copying in the loop, the first line ends up being copied about 1.5 billion times (1/2 of 55,000 squared). The program is spending all its time copying and garbage collecting. I ran this on my laptop (2.66GHz Core2Duo, 1GB heap) and it took 15 minutes to run when reading from a local file (no network latency or web crawling countermeasures).

To fix this, make theWebPage into a StringBuilder instead, and change the line in the loop to be

    theWebPage.append(" ").append(str);

You can convert theWebPage to a String using toString() after the loop if you wish. When I ran the modified version, it took a fraction of a second.

BTW your code is using a bare code block within { } inside a class. This is an instance initializer (as opposed to a static initializer). It gets run at object construction time. This is legal, but it's quite unusual. Notice that it misled other commenters. I'd suggest converting this code block into a named method.

113

answered Sep 23 '22 16:09

Stuart Marks

Is this your whole program? Where is the declaration of parserObject?

Also, shouldn't all of this code be in your main() prior to calling matchString()?

parserObject courseObject = new parserObject();
ArrayList<parserObject>  courseObjects = new ArrayList<parserObject>();
ArrayList<String> courseNames = new ArrayList<String>();
String theWebPage=" ";
{

    try {
            URL theUrl = new URL("http://ocw.mit.edu/courses/");
            BufferedReader reader = new BufferedReader(new InputStreamReader(theUrl.openStream()));
            String str = null;

            while((str = reader.readLine())!=null)
            {
                theWebPage = theWebPage+" "+str;
            }
            reader.close();

    } catch (MalformedURLException e) {

    } catch (IOException e) {

    }
}

You are also catching exceptions and not displaying any error messages. You should always display an error message and do something when you encounter an exception. For example, if you can't download the page, there is no reason to try to parse a empty string.

From you comment I learned about static blocks in classes (thank you, didn't know about them). However, from what I've read you need to put the keyword static before the start of the block {. Also, it might just be better to put the code into your main, that way you can exit if you get a MalformedURLException or IOException.

answered Sep 23 '22 16:09

HeatfanJohn

Related questions
                            
                                Redirect stdin and stdout in Java
                            
                                How to Create a Blob from Bitmap in Android Activity?
                            
                                How to determine if 8bit WAV File is signed or unsigned, using Java and without javax.sound
                            
                                Reading from Properties file v/s HashMap lookup
                            
                                cyclic traversal over enum
                            
                                Get current URL in Webapplication
                            
                                taking "new java.util.Date()" and making it 1 month prior
                            
                                instanceof vs isInstance()
                            
                                Can fully covered code have an EclEmma coverage rating of less than 100%?
                            
                                JOptionPane.showMessageDialog wait until OK is clicked?
                            
                                overiding constructors section of textbook not making sense
                            
                                Reading file chunk by chunk
                            
                                How to build query with selecting by value of foreign object's field
                            
                                Launch android app on screen unlock
                            
                                Running external program with redirected stdin and stdout from Java
                            
                                Intermediate Progress doesn't work with ActionBarSherlock running on Gingerbread
                            
                                Java Generics Capture List<?>
                            
                                How to make the background of a JTable transparent? [duplicate]
                            
                                how to set classpath for a Java program on hadoop file system
                            
                                Why same integer value have different memory address in Java? [duplicate]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Why is website crawling taking forever?

Tags:

java

regex

web-crawler

anonuser0428

People also ask

2 Answers

Stuart Marks

HeatfanJohn

Recent Activity

Donate For Us