Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to avoid surrounding html head tags in Jsoup parse

Tags:

Using Jsoup i try to parse the given html content. After Jsoup.parse() the html output append html, head and body tag to the input. I just want to ignore these.

Sample Input:

<p><b>This <i>is</i></b> <i>my sentence</i> of text.</p> 

Java code:

import java.io.File; import java.io.IOException;  import org.apache.commons.io.FileUtils; import org.jsoup.Jsoup; import org.jsoup.nodes.Document; import org.jsoup.nodes.Element; import org.jsoup.select.Elements;  public class HTMLParse {      public static void main(String args[]) throws IOException {         try{             File input = new File("/ab.html");             String html = FileUtils.readFileToString(input, null);              Document doc = Jsoup.parseBodyFragment(html);             doc.outputSettings().prettyPrint(false);             System.out.println(doc.html());         }         catch(Exception e){             e.printStackTrace();         }     } } 

Actual output:

<html><head></head><body><p><b>This <i>is</i></b> <i>my sentence</i> of text.</p>     </body></html> 

Expected Output:

<p><b>This <i>is</i></b> <i>my sentence</i> of text.</p> 

Please help.

like image 556
Roshan Avatar asked Oct 03 '14 05:10

Roshan


People also ask

Can jsoup parse JavaScript?

Jsoup parses the source code as delivered from the server (or in this case loaded from file). It does not invoke client-side actions such as JavaScript or CSS DOM manipulation.

What is jsoup parse?

jsoup can parse HTML files, input streams, URLs, or even strings. It eases data extraction from HTML by offering Document Object Model (DOM) traversal methods and CSS and jQuery-like selectors. jsoup can manipulate the content: the HTML element itself, its attributes, or its text.

Does jsoup work with XML?

But you can use it on XMLs as well and the good news is that they work just fine there. APIs present in Jsoup are easy to use. You can get the job done without having to write a colossal amount of code. Here's a step by step process on How to Read XML file in Java using Jsoup.

What is jsoup element?

A HTML element consists of a tag name, attributes, and child nodes (including text nodes and other elements). From an Element, you can extract data, traverse the node graph, and manipulate the HTML.


1 Answers

The cause:

parseBodyFragment() as well as all other parse()-methods use a HTML parser by default. And those add always the HTML-Shell (<html>…</html>, <head>…</head> etc.).

The Solution:

Just don't use a HTML-parser, use a XML-parser instead ;-)

Document doc = Jsoup.parse(html, "", Parser.xmlParser()); 

Replace that single line and your problem is solved.

Example:

final String html = "<p><b>This <i>is</i></b> <i>my sentence</i> of text.</p>";  Document docHtml = Jsoup.parse(html); Document docXml = Jsoup.parse(html, "", Parser.xmlParser());  System.out.println("******* HTML *******\n" + docHtml); System.out.println(); System.out.println("*******  XML *******\n" + docXml); 

Output:

******* HTML ******* <html>  <head></head>  <body>   <p><b>This <i>is</i></b> <i>my sentence</i> of text.</p>  </body> </html>  *******  XML ******* <p><b>This <i>is</i></b> <i>my sentence</i> of text.</p> 
like image 163
ollo Avatar answered Nov 25 '22 09:11

ollo