Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Replace string with jsoup only in text portions

Tags:

java

jsoup

I have found several topics with similar questions and valuable answers, but I am still struggling with this:

I want to parse some html with Jsoup so I can replace, for example,

"changeme"

with

<changed>changeme</changed>

, but only if it appears on a text portion of the html, no if it is part of a tag. So, starting with this html:

<body>
<p><a href="http://changeme.html">test changeme app</a></p>
</BODY>
</HTML>

I would want to get to this:

<body>
<p><a href="http://changeme.html">test <changed>changeme</changed> app</a></p>
</BODY>
</HTML>

I have tried several approaches, this one is which brings me closer to the desired result:

Document doc = null;
try {
    doc = Jsoup.parse(new File("tmp1450348256397.txt"), "UTF-8");
} catch (Exception ex) {
}

Elements els = doc.body().getAllElements();
for (Element e : els) {
    if (e.text().contains("changeme")) {
        e.html(e.html().replaceAll("changeme","<changed>changeme</changed>"));
    }
}
html = doc.toString();
System.out.println(html);

But with this approach I find two problems:

<body>
<p><a href="http://<changed>changeme</changed> .html">test
    <changed>
        changeme
    </changed> 
app</a></p>
</BODY>
</HTML>
  1. Line breaks are inserted before and after the new element I am introducing. This is not a real problem as I coul get rid of them if I use #changed# to do the replacing and after the doc.toString() I replace them again to the desired value (with < >).

  2. The real problem: The URL in the href has been modified, and I don't want it to happen.

Ideas? Thx.

like image 336
Marcos Fernandez Avatar asked Dec 17 '15 17:12

Marcos Fernandez


People also ask

What does jsoup parse do?

What It Is. jsoup can parse HTML files, input streams, URLs, or even strings. It eases data extraction from HTML by offering Document Object Model (DOM) traversal methods and CSS and jQuery-like selectors. jsoup can manipulate the content: the HTML element itself, its attributes, or its text.

Can jsoup parse JavaScript?

Jsoup parses the source code as delivered from the server (or in this case loaded from file). It does not invoke client-side actions such as JavaScript or CSS DOM manipulation.

Is jsoup thread safe?

Both Jsoup and Parser classes have no state and are only holding static methods. TreeBuilder class though has a state and seems to be doing all the work but it's created from within a method therefore the whole operation is thread-safe by virtue of stack/thread confinement.


1 Answers

Here is my solution:

String html=""
    +"<p><a href=\"http://changeme.html\">"
    +   "test changeme "
    +   "<div class=\"changeme\">"
    +     "inner text changeme"
    +   "</div>"
    +   " app</a>"
    +"</p>";
Document doc = Jsoup.parse(html);
Elements els = doc.body().getAllElements();
for (Element e : els) {
    List<TextNode> tnList = e.textNodes();
    for (TextNode tn : tnList){
        String orig = tn.text();
        tn.text(orig.replaceAll("changeme","<changed>changeme</changed>")); 
    }
}

html = doc.toString();
System.out.println(html);

TextNodes are always leaf nodes, i.e. they do not contain more HTML elements. In your original approach you replace the HTML of an element with new HTML with replaced changme strings. You only check for the changeme to be part of the TextNodes contents, but you replace every occurrence in the HTML string of the element, including all occurrences outside TextNodes.

My solution basically works like yours, but I use the JSoup method textNodes(). This way I don't need to typecast.

P.S. Of course, my solution as well as yours will contain &lt;changed&gt;changeme&lt;/changed&gt; instead of <changed>changeme</changed> in the end. This may or may not be what you want. If you do not want this, then your result is not any more valid HTML, since changed is no valid HTML tag. Jsoup will not help you in this case. However, you can of course replace in the resulting string all &lt;changed&gt;changeme&lt;/changed&gt; again - outside JSoup.

like image 151
luksch Avatar answered Oct 27 '22 19:10

luksch