Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Jsoup - Keep only the tags and remove all the text

Tags:

java

html

jsoup

I am trying to remove all the texts between the tags of an HTML page using Jsoup

For example, if the input HTML is

<!DOCTYPE html>
<html>
<body>
<h1>My First Heading</h1>
<p>My first paragraph.</p>
</body>
</html>

The output should be

<!DOCTYPE html>
<html>
<body>
<h1></h1>
<p></p>
</body>
</html>

Basically, I want to remove what is returned by doc.text()

I have found a lot of posts to do the contrary and keep only the text, but nothing to solve my problem. Any idea on how to do this?

EDIT

The solution proposed by maverick9999 : https://stackoverflow.com/a/24292349/3589481 will solve most of the cases.

However, as noticed in comments this solution will also remove the nested tags.

As an example:

    String str = "<!DOCTYPE html>" +
                "<html>" +
                "<body>" +
                "<div class='foo'>text <div class='THIS DIV WILL BE REMOVED'>text</div> text </div>" +
                "<h1>My First Heading</h1>\n" +
                "<p>My first paragraph.</p>\n" +
                "</body>\n" +
                "</html>";

        Document doc=Jsoup.parse(str);
        removeAllTexts(doc);
        System.out.println(doc);

        Elements all=doc.select("*");
        Iterator<Element>iterator=all.iterator();
        while(iterator.hasNext()){
            Element e=iterator.next();
            if(!e.ownText().isEmpty()){
                e.text("");
            }
        }

        System.out.println(doc);

Will remove one div in the output:

    <html>
     <head></head>
     <body>
      <div class="foo">
      </div>
     </body>
    </html>

Any thoughts to avoid this?

EDIT 2

For some reason, the tag "meta" is considered as self-closing by Jsoup. So if you have something like this:

System.out.println("\n\n----");
String html = "<!DOCTYPE html>\r\n"
+ "<html>\r\n"
+ "<head>\n" 
+ "<meta content=\"/myimage.png\" itemprop=\"image\">\n"
+ "<title>Title</title>\n" 
+ "<script>Random Javascript here</script>"
+ "</meta>"
+ "</head>"
+ "<body>\r\n"
+ "<h1>My First <i>Heading</i></h1>\r\n"
+ "<hr/>\r\n"
+ "<p>My first paragraph.</p>\r\n"
+ "<p> <div class='foo'>text <div class='bar'> text </div> text </div> </p>\r\n"
+ "</body>\r\n" 
+ "</html>";

Document doc2 = Jsoup.parse(html,"",Parser.xmlParser());
printNodes(doc2);

Then all the tags after meta will not be read. With Pshemo solution, the scripts are removed and if you have br tags with children (for example), they will be removed as well. I finally ended up with the following solution (thanks to Pshemo for his help):

   public static void printNodes(Node node) {
        String name = node.nodeName();
        if (name.equals("#doctype")) {
            System.out.println(node);
        } else if (name.equals("#text")) {
            return;
        } else if (name.equals("#document")) {
            for (Node n : node.childNodes())
                printNodes(n);
        } 
        // There is no reason to have text here, so print everything
        else if (name.equals("head") || name.equals("script")){
            System.out.println(node.toString());
        }
        else {
            if (!Tag.valueOf(name).isSelfClosing() || node.childNodeSize()>0) {
                System.out.println("<" + name + getAttributes(node) + ">");
                for (Node n : node.childNodes())
                    printNodes(n);
                System.out.println("</" + name + ">");
            } else {
                // System.out.println("debug: " + name + " is self closing");
                System.out.println("<" + name + getAttributes(node) + "/>");
            }
        }
    }

   public static String getAttributes(Node node) {
        StringBuilder sb = new StringBuilder();
        for (Attribute attr : node.attributes()) {
            sb.append(" ").append(attr.getKey()).append("=\"")
                    .append(attr.getValue()).append("\"");
        }
        return sb.toString();
    }
like image 353
Yannickv Avatar asked Jun 18 '14 16:06

Yannickv


People also ask

What does jsoup clean do?

clean. Creates a new, clean document, from the original dirty document, containing only elements allowed by the safelist. The original document is not modified. Only elements from the dirty document's body are used.

Is jsoup deprecated?

Deprecated. As of release v1. 14.1 , this class is deprecated in favour of Safelist .

What is jsoup parse?

jsoup can parse HTML files, input streams, URLs, or even strings. It eases data extraction from HTML by offering Document Object Model (DOM) traversal methods and CSS and jQuery-like selectors. jsoup can manipulate the content: the HTML element itself, its attributes, or its text.


1 Answers

The below code should solve your problem with nested tags:

Updated code:

Document doc = Jsoup.parse(html, "", Parser.xmlParser());

for (Element el : doc.select("*")){
    if (!el.ownText().isEmpty()){
        for (TextNode node : el.textNodes())
            node.remove();
    }
}

System.out.println(doc);
like image 101
theinvisible Avatar answered Oct 20 '22 09:10

theinvisible