Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Jsoup. Print all text nodes in order

I want to parse this with Jsoup (this is a simplification, I would be parsing entire web pages)

<html><body><p>A<strong>B</strong>C<strong>D</strong>E</p></body></html>

to obtain all text elements in the order they appear, this is:

A B C D E

I have tried two approaches:

Elements elements = doc.children().select("*");
for (Element el : elements)
    System.out.println(el.ownText());

which returns:

A C E B D

This is, the elements between "strong" tags go at the end.

I have also tried a recursive version:

myfunction(doc.children());

private void myfunction(Elements elements) {
  for (Element el : elements){
    List<Node> nodos = el.childNodes();       
    for (Node nodo : nodos) {                
      if (nodo instanceof TextNode && !((TextNode) nodo).isBlank()) {
      System.out.println(((TextNode) nodo).text()); 
    }
  }
  myfunction(el.children());
} 

But the result is the same as before.

How can this be accomplished? I feel I am making difficult something simple ...

like image 203
Marcos Fernandez Avatar asked Oct 26 '25 03:10

Marcos Fernandez


2 Answers

How about:

private static void myfunction(Node element) {
    for (Node n : element.childNodes()) {
        if (n instanceof TextNode && !((TextNode) n).isBlank()) {
            System.out.println(((TextNode) n).text());
        } else {
            myfunction(n);
        }
    }
}

Demo:

String html = "<html><body><p>A<strong>B</strong>C<strong>D</strong>E</p></body></html>";
Document doc = Jsoup.parse(html);
myfunction(doc.body());

Output:

A
B
C
D
E

Java 15 update to avoid casting (TextNode) n (for more details see JEP 375: Pattern Matching for instanceof (Second Preview))

private static void myfunction(Node element) {
    for (Node n : element.childNodes()) {
        if (n instanceof TextNode tNode && !tNode.isBlank()) {
            System.out.println(tNode.text());
        } else {
            myfunction(n);
        }
    }
}
like image 191
Pshemo Avatar answered Oct 28 '25 16:10

Pshemo


The text() method will do the trick e.g. below

    public static void main(String[] args) {
            Document doc = Jsoup.parse("<html><body><p>A<strong>B</strong>C<strong>D</strong>E</p></body></html>");
            String texts = doc.body().text();
            System.out.println(texts);
   }
like image 27
johnII Avatar answered Oct 28 '25 15:10

johnII



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!