Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I tell HtmlUnit's WebClient to download images and css?

Tags:

java

htmlunit

How can I make WebClient download external css stylesheets and image bodies just like a usual web browser does?

like image 584
Fluffy Avatar asked Feb 11 '10 12:02

Fluffy


3 Answers

What I'm doing right now is:

public static final HashMap<String, String> acceptTypes = new HashMap<String, String>(){{
        put("html", "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8");
        put("img", "image/png,image/*;q=0.8,*/*;q=0.5");
        put("script", "*/*");
        put("style", "text/css,*/*;q=0.1");
    }};

protected void downloadCssAndImages(HtmlPage page) {
        String xPathExpression = "//*[name() = 'img' or name() = 'link' and @type = 'text/css']";
        List<?> resultList = page.getByXPath(xPathExpression);

        Iterator<?> i = resultList.iterator();
        while (i.hasNext()) {
            try {
                HtmlElement el = (HtmlElement) i.next();

                String path = el.getAttribute("src").equals("")?el.getAttribute("href"):el.getAttribute("src");
                if (path == null || path.equals("")) continue;

                URL url = page.getFullyQualifiedUrl(path);

                WebRequestSettings wrs = new WebRequestSettings(url);
                wrs.setAdditionalHeader("Referer", page.getWebResponse().getRequestSettings().getUrl().toString());

                client.addRequestHeader("Accept", acceptTypes.get(el.getTagName().toLowerCase()));
                client.getPage(wrs);
            } catch (Exception e) {}
        }



client.removeRequestHeader("Accept");
}
like image 195
Fluffy Avatar answered Nov 10 '22 20:11

Fluffy


source : How to get base64 encoded contents for an ImageReader?

HtmlImage img = (HtmlImage) p.getByXPath("//img").get(3);
ImageReader imageReader = img.getImageReader();
BufferedImage bufferedImage = imageReader.read(0);
String formatName = imageReader.getFormatName();
ByteArrayOutputStream byteaOutput = new ByteArrayOutputStream();
Base64OutputStream base64Output = new base64OutputStream(byteaOutput);
ImageIO.write(bufferedImage, formatName, base64output);
String base64 = new String(byteaOutput.toByteArray());
like image 1
jer Avatar answered Nov 10 '22 20:11

jer


Here's what I came up with:

public InputStream httpGetLowLevel(URL url) throws IOException
{
    WebRequest wrq=new WebRequest(url);

    ProxyConfig config =webClient.getProxyConfig();

    //set request webproxy
    wrq.setProxyHost(config.getProxyHost());
    wrq.setProxyPort(config.getProxyPort());
    wrq.setCredentials(webClient.getCredentialsProvider().getCredentials(new AuthScope(config.getProxyHost(), config.getProxyPort())));
    for(Cookie c:webClient.getCookieManager().getCookies(url)){
        wrq.setAdditionalHeader("Cookie", c.toString());            
    }           
    WebResponse wr= webClient.getWebConnection().getResponse(wrq);
    return wr.getContentAsStream();
}

My tests show, that it does support proxys and that it not only carries cookies from WebClient, but also if server sends new cookies during the response, the WebClient will eat those cookies

like image 1
Arsen Zahray Avatar answered Nov 10 '22 21:11

Arsen Zahray