Jsoup - Keep only the tags and remove all the text

Tags:

I am trying to remove all the texts between the tags of an HTML page using Jsoup

For example, if the input HTML is

<!DOCTYPE html>
<html>
<body>
<h1>My First Heading</h1>
<p>My first paragraph.</p>
</body>
</html>

The output should be

<!DOCTYPE html>
<html>
<body>
<h1></h1>
<p></p>
</body>
</html>

Basically, I want to remove what is returned by doc.text()

I have found a lot of posts to do the contrary and keep only the text, but nothing to solve my problem. Any idea on how to do this?

EDIT

The solution proposed by maverick9999 : https://stackoverflow.com/a/24292349/3589481 will solve most of the cases.

However, as noticed in comments this solution will also remove the nested tags.

As an example:

    String str = "<!DOCTYPE html>" +
                "<html>" +
                "<body>" +
                "<div class='foo'>text <div class='THIS DIV WILL BE REMOVED'>text</div> text </div>" +
                "<h1>My First Heading</h1>\n" +
                "<p>My first paragraph.</p>\n" +
                "</body>\n" +
                "</html>";

        Document doc=Jsoup.parse(str);
        removeAllTexts(doc);
        System.out.println(doc);

        Elements all=doc.select("*");
        Iterator<Element>iterator=all.iterator();
        while(iterator.hasNext()){
            Element e=iterator.next();
            if(!e.ownText().isEmpty()){
                e.text("");
            }
        }

        System.out.println(doc);

Will remove one div in the output:

    <html>
     <head></head>
     <body>
      <div class="foo">
      </div>
     </body>
    </html>

Any thoughts to avoid this?

EDIT 2

For some reason, the tag "meta" is considered as self-closing by Jsoup. So if you have something like this:

System.out.println("\n\n----");
String html = "<!DOCTYPE html>\r\n"
+ "<html>\r\n"
+ "<head>\n" 
+ "<meta content=\"/myimage.png\" itemprop=\"image\">\n"
+ "<title>Title</title>\n" 
+ "<script>Random Javascript here</script>"
+ "</meta>"
+ "</head>"
+ "<body>\r\n"
+ "<h1>My First <i>Heading</i></h1>\r\n"
+ "<hr/>\r\n"
+ "<p>My first paragraph.</p>\r\n"
+ "<p> <div class='foo'>text <div class='bar'> text </div> text </div> </p>\r\n"
+ "</body>\r\n" 
+ "</html>";

Document doc2 = Jsoup.parse(html,"",Parser.xmlParser());
printNodes(doc2);

Then all the tags after meta will not be read. With Pshemo solution, the scripts are removed and if you have br tags with children (for example), they will be removed as well. I finally ended up with the following solution (thanks to Pshemo for his help):

   public static void printNodes(Node node) {
        String name = node.nodeName();
        if (name.equals("#doctype")) {
            System.out.println(node);
        } else if (name.equals("#text")) {
            return;
        } else if (name.equals("#document")) {
            for (Node n : node.childNodes())
                printNodes(n);
        } 
        // There is no reason to have text here, so print everything
        else if (name.equals("head") || name.equals("script")){
            System.out.println(node.toString());
        }
        else {
            if (!Tag.valueOf(name).isSelfClosing() || node.childNodeSize()>0) {
                System.out.println("<" + name + getAttributes(node) + ">");
                for (Node n : node.childNodes())
                    printNodes(n);
                System.out.println("</" + name + ">");
            } else {
                // System.out.println("debug: " + name + " is self closing");
                System.out.println("<" + name + getAttributes(node) + "/>");
            }
        }
    }

   public static String getAttributes(Node node) {
        StringBuilder sb = new StringBuilder();
        for (Attribute attr : node.attributes()) {
            sb.append(" ").append(attr.getKey()).append("=\"")
                    .append(attr.getValue()).append("\"");
        }
        return sb.toString();
    }

353

asked Jun 18 '14 16:06

Yannickv

1 Answers

The below code should solve your problem with nested tags:

Updated code:

Document doc = Jsoup.parse(html, "", Parser.xmlParser());

for (Element el : doc.select("*")){
    if (!el.ownText().isEmpty()){
        for (TextNode node : el.textNodes())
            node.remove();
    }
}

System.out.println(doc);

101

answered Oct 20 '22 09:10

theinvisible

Related questions
                            
                                Decoding h264 ByteStream on Android
                            
                                Spring mvc throwing org.springframework.web.HttpMediaTypeNotAcceptableException: Could not find acceptable representation
                            
                                Homoglyph attack detection in email phishing
                            
                                JavaFx: How to install a Tooltip on ImageView
                            
                                mockito: How to match varargs in java 8?
                            
                                Understanding "proxy" arguments of the invoke method of java.lang.reflect.InvocationHandler
                            
                                SpelEvaluationException: EL1007E:(pos 43): Field or property 'group' cannot be found on null
                            
                                How to get generic type information from getAnnotatedParameterTypes() in Java 8?
                            
                                Creating OpenCV Haar Classifier from an existing model
                            
                                Returning Object with type arguments using generics and avoiding Type Safety warnings
                            
                                Java independent variable vs array performance
                            
                                How to set java version in JBoss 7?
                            
                                Implementing "Check" in a Chess Game
                            
                                An example of URL which cannot be converted .toURI()?
                            
                                What is use of super.paint(g)?
                            
                                Force use of a base class method in Java
                            
                                Is the tag "<optional>" valid in dependencyManagement element?
                            
                                Camel route-testing using adviceWith with OnException definitions
                            
                                scope of local variable in enhanced for-loop
                            
                                Error performing load command : org.hibernate.exception.SQLGrammarException: could not extract ResultSet Exception in thread "main"

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Jsoup - Keep only the tags and remove all the text

Tags:

java

html

jsoup

Yannickv

People also ask

1 Answers

theinvisible

Recent Activity

Donate For Us