How to avoid surrounding html head tags in Jsoup parse

Tags:

Using Jsoup i try to parse the given html content. After Jsoup.parse() the html output append html, head and body tag to the input. I just want to ignore these.

Sample Input:

<p><b>This <i>is</i></b> <i>my sentence</i> of text.</p>

Java code:

import java.io.File; import java.io.IOException;  import org.apache.commons.io.FileUtils; import org.jsoup.Jsoup; import org.jsoup.nodes.Document; import org.jsoup.nodes.Element; import org.jsoup.select.Elements;  public class HTMLParse {      public static void main(String args[]) throws IOException {         try{             File input = new File("/ab.html");             String html = FileUtils.readFileToString(input, null);              Document doc = Jsoup.parseBodyFragment(html);             doc.outputSettings().prettyPrint(false);             System.out.println(doc.html());         }         catch(Exception e){             e.printStackTrace();         }     } }

Actual output:

<html><head></head><body><p><b>This <i>is</i></b> <i>my sentence</i> of text.</p>     </body></html>

Expected Output:

<p><b>This <i>is</i></b> <i>my sentence</i> of text.</p>

Please help.

556

asked Oct 03 '14 05:10

Roshan

1 Answers

The cause:

parseBodyFragment() as well as all other parse()-methods use a HTML parser by default. And those add always the HTML-Shell (<html>…</html>, <head>…</head> etc.).

The Solution:

Just don't use a HTML-parser, use a XML-parser instead ;-)

Document doc = Jsoup.parse(html, "", Parser.xmlParser());

Replace that single line and your problem is solved.

Example:

final String html = "<p><b>This <i>is</i></b> <i>my sentence</i> of text.</p>";  Document docHtml = Jsoup.parse(html); Document docXml = Jsoup.parse(html, "", Parser.xmlParser());  System.out.println("******* HTML *******\n" + docHtml); System.out.println(); System.out.println("*******  XML *******\n" + docXml);

Output:

******* HTML ******* <html>  <head></head>  <body>   <p><b>This <i>is</i></b> <i>my sentence</i> of text.</p>  </body> </html>  *******  XML ******* <p><b>This <i>is</i></b> <i>my sentence</i> of text.</p>

163

answered Nov 25 '22 09:11

ollo

Related questions
                            
                                Brew versions unknown command mavericks
                            
                                Execute code before/after every controller action
                            
                                How to get process id by its service name with a script to variable
                            
                                Sails.Js - How I do pagination in sails.Js
                            
                                Error "cannot open the connection" in executing "knit HTML" in RStudio
                            
                                Setting UISearchBar's search field background image changes the padding
                            
                                Update all documents in a collection with random numbers
                            
                                Is empty() enough or use isset()?
                            
                                Why do fields seem to be initialized before constructor?
                            
                                Assertion error at: Django-rest-Framework
                            
                                Extending Math object through prototype doesn't work
                            
                                NullPointerException with android.support.v7.widget.Toolbar

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With