Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

HtmlUnit close all windows memory leak

HtmlUnit does not appears to close windows in the webclient and thus creating a memory leak. I am trying to get a page with HtmlUnit and pass it off to JSoup for parsing. I am aware that JSoup can connect to a page but I need to use this approach as I need to hold a logged in session on some sites prior to parsing them.

Here is the code:

import java.io.IOException;
import java.net.MalformedURLException;

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;

import com.gargoylesoftware.htmlunit.BrowserVersion;
import com.gargoylesoftware.htmlunit.FailingHttpStatusCodeException;
import com.gargoylesoftware.htmlunit.WebClient;
import com.gargoylesoftware.htmlunit.html.HtmlPage;

public class HtmlUnitLeakTest {

public static void main(String args[]) throws FailingHttpStatusCodeException, MalformedURLException, IOException{

        WebClient webClient = new WebClient(BrowserVersion.CHROME);
        webClient.getOptions().setPrintContentOnFailingStatusCode(false);
        webClient.getOptions().setThrowExceptionOnFailingStatusCode(false);
        webClient.getOptions().setThrowExceptionOnScriptError(false);
        webClient.getOptions().setJavaScriptEnabled(true);
        webClient.getOptions().setCssEnabled(false);

        for(int i = 0; i < 500; i++){
            HtmlPage page = webClient.getPage("http://www.stackoverflow.com");
            Document doc = Jsoup.parse(page.asXml());
            webClient.closeAllWindows();
            System.out.println(i);
            if((i % 5 == 0)){
                System.out.println(i);
            }
        }
    }
}

As this runs the memory continually climbs and in my debug screen I can see all the windows are still referenced under the webclient and not closed.

I have seen this code around that is suppose to close these windows:

List<WebWindow> windows = webclient.getWebWindows();
for (WebWindow ww : windows) {
    ww.getJobManager().removeAllJobs();
    ww.getJobManager().shutdown();
}
webclient.closeAllWindows();

But alas it does not and I continue to have the memory leak.

Anyone experienced this issue?

Cheers

Version info:

HtmlUnit 2.15

java version "1.7.0_51"

Java(TM) SE Runtime Environment (build 1.7.0_51-b13)

Java HotSpot(TM) 64-Bit Server VM (build 24.51-b03, mixed mode)
like image 882
user1625233 Avatar asked Oct 19 '14 13:10

user1625233


1 Answers

I have a piece of code very similar to yours, and I've been pulling my hair out for the last 2 days trying to solve this. I tried everything they mention on the web and I could not find a solution - to the point where I started messing around with the code and suddenly, the leak stopped. I was using a memory analyzer tool and my program got the point where it was using 2gb of ram (which I set up as java heap in the jvm arguments), and then it crashed after 20 minutes. Now it has been running for 1 hour and the memory usage is stable at 10mb.

What did I do? I've put the webClient initialization inside the for loop:

public class HtmlUnitLeakTest {

   public static void main(String args[]) throws FailingHttpStatusCodeException, MalformedURLException, IOException{

    for(int i = 0; i < 500; i++){
    try{
        WebClient webClient = initializeClient();

        HtmlPage page = webClient.getPage("http://www.stackoverflow.com");
        Document doc = Jsoup.parse(page.asXml());
        webClient.closeAllWindows();
        System.out.println(i);
        if((i % 5 == 0)){
            System.out.println(i);
        }
    }finally {
            webClient.getCurrentWindow().getJobManager().removeAllJobs();
            webClient.close();
            System.gc();
            }
        }
    }

    private static WebClient initilizeCilent(){
    final WebClient webClient = new WebClient(BrowserVersion.CHROME);
    webClient.getOptions().setPrintContentOnFailingStatusCode(false);
    webClient.getOptions().setThrowExceptionOnFailingStatusCode(false);
    webClient.getOptions().setThrowExceptionOnScriptError(false);
    webClient.getOptions().setJavaScriptEnabled(true);
    webClient.getOptions().setCssEnabled(false);

    return webClient;
    }
}

I know it is a theoretically wrong approach, but I was desperate to get it working, and now it does! If you already fixed the problem with a better (correct) approach, please I would like to know that too!

like image 51
pulu Avatar answered Nov 05 '22 00:11

pulu