Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Avoid removal of spaces and newline while parsing html using jsoup

I have a sample code as below.

String sample = "<html>
<head>
</head>
<body>
This is a sample on              parsing html body using jsoup
This is a sample on              parsing html body using jsoup
</body>
</html>";

Document doc = Jsoup.parse(sample);
String output = doc.body().text();

I get the output as

This is a sample on parsing html body using jsoup This is a sample on `parsing html body using jsoup`

But I want the output as

This is a sample on              parsing html body using jsoup
This is a sample on              parsing html body using jsoup

How do parse it so that I get this output? Or is there another way to do so in Java?

like image 773
Aparna Avatar asked Nov 03 '16 08:11

Aparna


1 Answers

You can disable the pretty printing of your document to get the output like you want it. But you also have to change the .text() to .html().

Document doc = Jsoup.parse(sample);
doc.outputSettings(new Document.OutputSettings().prettyPrint(false));
String output = doc.body().html();
like image 115
Benjamin P. Avatar answered Oct 20 '22 14:10

Benjamin P.