I have found several topics with similar questions and valuable answers, but I am still struggling with this:
I want to parse some html with Jsoup so I can replace, for example,
"changeme"
with
<changed>changeme</changed>
, but only if it appears on a text portion of the html, no if it is part of a tag. So, starting with this html:
<body>
<p><a href="http://changeme.html">test changeme app</a></p>
</BODY>
</HTML>
I would want to get to this:
<body>
<p><a href="http://changeme.html">test <changed>changeme</changed> app</a></p>
</BODY>
</HTML>
I have tried several approaches, this one is which brings me closer to the desired result:
Document doc = null;
try {
doc = Jsoup.parse(new File("tmp1450348256397.txt"), "UTF-8");
} catch (Exception ex) {
}
Elements els = doc.body().getAllElements();
for (Element e : els) {
if (e.text().contains("changeme")) {
e.html(e.html().replaceAll("changeme","<changed>changeme</changed>"));
}
}
html = doc.toString();
System.out.println(html);
But with this approach I find two problems:
<body>
<p><a href="http://<changed>changeme</changed> .html">test
<changed>
changeme
</changed>
app</a></p>
</BODY>
</HTML>
Line breaks are inserted before and after the new element I am introducing. This is not a real problem as I coul get rid of them if I use #changed# to do the replacing and after the doc.toString() I replace them again to the desired value (with < >).
The real problem: The URL in the href has been modified, and I don't want it to happen.
Ideas? Thx.
What It Is. jsoup can parse HTML files, input streams, URLs, or even strings. It eases data extraction from HTML by offering Document Object Model (DOM) traversal methods and CSS and jQuery-like selectors. jsoup can manipulate the content: the HTML element itself, its attributes, or its text.
Jsoup parses the source code as delivered from the server (or in this case loaded from file). It does not invoke client-side actions such as JavaScript or CSS DOM manipulation.
Both Jsoup and Parser classes have no state and are only holding static methods. TreeBuilder class though has a state and seems to be doing all the work but it's created from within a method therefore the whole operation is thread-safe by virtue of stack/thread confinement.
Here is my solution:
String html=""
+"<p><a href=\"http://changeme.html\">"
+ "test changeme "
+ "<div class=\"changeme\">"
+ "inner text changeme"
+ "</div>"
+ " app</a>"
+"</p>";
Document doc = Jsoup.parse(html);
Elements els = doc.body().getAllElements();
for (Element e : els) {
List<TextNode> tnList = e.textNodes();
for (TextNode tn : tnList){
String orig = tn.text();
tn.text(orig.replaceAll("changeme","<changed>changeme</changed>"));
}
}
html = doc.toString();
System.out.println(html);
TextNodes are always leaf nodes, i.e. they do not contain more HTML elements. In your original approach you replace the HTML of an element with new HTML with replaced changme
strings. You only check for the changeme to be part of the TextNodes contents, but you replace every occurrence in the HTML string of the element, including all occurrences outside TextNodes.
My solution basically works like yours, but I use the JSoup method textNodes()
. This way I don't need to typecast.
P.S.
Of course, my solution as well as yours will contain <changed>changeme</changed>
instead of <changed>changeme</changed>
in the end. This may or may not be what you want. If you do not want this, then your result is not any more valid HTML, since changed
is no valid HTML tag. Jsoup will not help you in this case. However, you can of course replace in the resulting string all <changed>changeme</changed>
again - outside JSoup.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With