Using Jsoup i try to parse the given html content. After Jsoup.parse() the html output append html, head and body tag to the input. I just want to ignore these.
Sample Input:
<p><b>This <i>is</i></b> <i>my sentence</i> of text.</p>
Java code:
import java.io.File; import java.io.IOException; import org.apache.commons.io.FileUtils; import org.jsoup.Jsoup; import org.jsoup.nodes.Document; import org.jsoup.nodes.Element; import org.jsoup.select.Elements; public class HTMLParse { public static void main(String args[]) throws IOException { try{ File input = new File("/ab.html"); String html = FileUtils.readFileToString(input, null); Document doc = Jsoup.parseBodyFragment(html); doc.outputSettings().prettyPrint(false); System.out.println(doc.html()); } catch(Exception e){ e.printStackTrace(); } } }
Actual output:
<html><head></head><body><p><b>This <i>is</i></b> <i>my sentence</i> of text.</p> </body></html>
Expected Output:
<p><b>This <i>is</i></b> <i>my sentence</i> of text.</p>
Please help.
Jsoup parses the source code as delivered from the server (or in this case loaded from file). It does not invoke client-side actions such as JavaScript or CSS DOM manipulation.
jsoup can parse HTML files, input streams, URLs, or even strings. It eases data extraction from HTML by offering Document Object Model (DOM) traversal methods and CSS and jQuery-like selectors. jsoup can manipulate the content: the HTML element itself, its attributes, or its text.
But you can use it on XMLs as well and the good news is that they work just fine there. APIs present in Jsoup are easy to use. You can get the job done without having to write a colossal amount of code. Here's a step by step process on How to Read XML file in Java using Jsoup.
A HTML element consists of a tag name, attributes, and child nodes (including text nodes and other elements). From an Element, you can extract data, traverse the node graph, and manipulate the HTML.
parseBodyFragment()
as well as all other parse()
-methods use a HTML parser by default. And those add always the HTML-Shell (<html>…</html>
, <head>…</head>
etc.).
Just don't use a HTML-parser, use a XML-parser instead ;-)
Document doc = Jsoup.parse(html, "", Parser.xmlParser());
Replace that single line and your problem is solved.
final String html = "<p><b>This <i>is</i></b> <i>my sentence</i> of text.</p>"; Document docHtml = Jsoup.parse(html); Document docXml = Jsoup.parse(html, "", Parser.xmlParser()); System.out.println("******* HTML *******\n" + docHtml); System.out.println(); System.out.println("******* XML *******\n" + docXml);
Output:
******* HTML ******* <html> <head></head> <body> <p><b>This <i>is</i></b> <i>my sentence</i> of text.</p> </body> </html> ******* XML ******* <p><b>This <i>is</i></b> <i>my sentence</i> of text.</p>
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With