Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Extract CSS Styles from HTML using JSOUP in JAVA

Tags:

java

jsoup

Can anyone help with extraction of CSS styles from HTML using Jsoup in Java. For e.g in below html i want to extract .ft00 and .ft01

<HTML>
<HEAD>
<TITLE>Page 1</TITLE>

<META http-equiv="Content-Type" content="text/html; charset=UTF-8">
<DIV style="position:relative;width:931;height:1243;">
<STYLE type="text/css">
<!--
    .ft00{font-size:11px;font-family:Times;color:#ffffff;}
    .ft01{font-size:11px;font-family:Times;color:#ffffff;}
-->
</STYLE>
</HEAD>
</HTML>
like image 968
Yashpal Singla Avatar asked Oct 31 '12 13:10

Yashpal Singla


People also ask

What does jsoup do in Java?

What It Is. jsoup can parse HTML files, input streams, URLs, or even strings. It eases data extraction from HTML by offering Document Object Model (DOM) traversal methods and CSS and jQuery-like selectors. jsoup can manipulate the content: the HTML element itself, its attributes, or its text.

Can we use XPath in jsoup?

With XPath expressions it is able to select the elements within the HTML using Jsoup as HTML parser.

Is jsoup deprecated?

Deprecated. As of release v1. 14.1 , this class is deprecated in favour of Safelist .


2 Answers

If the style is embedded in your Element you just have to use .attr("style").

JSoup is not a Html renderer, it is just a HTML parser, so you will have to parse the content from the retrieved <style> tag html content. You can use a simple regex for this; but it won't work in all cases. You may want to use a CSS parser for this task.

public class Test {
    public static void main(String[] args) throws Exception {
        String html = "<HTML>\n" +
                "<HEAD>\n"+
                "<TITLE>Page 1</TITLE>\n"+
                "<META http-equiv=\"Content-Type\" content=\"text/html; charset=UTF-8\">\n"+
                "<DIV style=\"position:relative;width:931;height:1243;\">\n"+
                "<STYLE type=\"text/css\">\n"+
                "<!--\n"+
                "    .ft00{font-size:11px;font-family:Times;color:#ffffff;}\n"+
                "    .ft01{font-size:11px;font-family:Times;color:#ffffff;}\n"+
                "-->\n"+
                "</STYLE>\n"+
                "</HEAD>\n"+
                "</HTML>";

        Document doc = Jsoup.parse(html);
        Element style = doc.select("style").first();
        Matcher cssMatcher = Pattern.compile("[.](\\w+)\\s*[{]([^}]+)[}]").matcher(style.html());
        while (cssMatcher.find()) {
            System.out.println("Style `" + cssMatcher.group(1) + "`: " + cssMatcher.group(2));
        }
    }
}

Will output:

Style `ft00`: font-size:11px;font-family:Times;color:#ffffff;
Style `ft01`: font-size:11px;font-family:Times;color:#ffffff;
like image 160
Alex Avatar answered Sep 27 '22 23:09

Alex


Try this:

Document document = Jsoup.parse(html);
String style = document.select("style").first().data();

You can then use a CSS parser to fetch the details you are interested in.

  • http://www.w3.org/Style/CSS/SAC
  • http://cssparser.sourceforge.net
  • https://github.com/corgrath/osbcp-css-parser#readme
like image 21
Emmanuel Bourg Avatar answered Sep 27 '22 23:09

Emmanuel Bourg