Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Parse HTML with jsoup and preserve original content

I want to replace some elements in HTML files, keeping all the other content unchanged.

Document doc = Jsoup.parse("<div id=title>Old</div >\n" +
        "<p>1<p>2\n" +
        "<table><tr><td>1</td></tr></table>");
doc.getElementById("title").text("New");
System.out.println(doc.toString());

I expect to have the following output:

<div id=title>New</span></div >
<p>1<p>2
<table><tr><td>1</td></tr></table>

Instead, I have:

<html>
 <head></head>
 <body>
  <div id="title">New</div>
  <p>1</p>
  <p>2 </p>
  <table>
   <tbody>
    <tr>
     <td>1</td>
    </tr>
   </tbody>
  </table>
 </body>
</html>

Jsoup added:

  1. closing p tags
  2. double-quotes to attribute values
  3. tbody
  4. html, head and body elements

Can I serialise modified HTML back to original? Jericho does that but it doesn’t provide slick DOM manipulation methods as Jsoup does.

like image 366
NVI Avatar asked Aug 22 '12 09:08

NVI


1 Answers

Is there a reason why attribute values shouldn't get quoted? See here and here.

For the other points try this:

final String html = "<div id=title>Old</div >\n"
            + "<p>1<p>2\n"
            + "<table><tr><td>1</td></tr></table>";

Document doc = Jsoup.parse(html);
doc.select("[id=title]").first().text("New");
doc.select("body, head, html, tbody").unwrap();
doc.outputSettings().prettyPrint(false);

System.out.println(doc);
like image 182
ollo Avatar answered Sep 24 '22 06:09

ollo