Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do I preserve line breaks when using jsoup to convert html to plain text?

Tags:

java

jsoup

I have the following code:

 public class NewClass {      public String noTags(String str){          return Jsoup.parse(str).text();      }        public static void main(String args[]) {          String strings="<!DOCTYPE HTML PUBLIC \"-//W3C//DTD HTML 4.0 Transitional//EN \">" +          "<HTML> <HEAD> <TITLE></TITLE> <style>body{ font-size: 12px;font-family: verdana, arial, helvetica, sans-serif;}</style> </HEAD> <BODY><p><b>hello world</b></p><p><br><b>yo</b> <a href=\"http://google.com\">googlez</a></p></BODY> </HTML> ";           NewClass text = new NewClass();          System.out.println((text.noTags(strings))); } 

And I have the result:

hello world yo googlez 

But I want to break the line:

hello world yo googlez 

I have looked at jsoup's TextNode#getWholeText() but I can't figure out how to use it.

If there's a <br> in the markup I parse, how can I get a line break in my resulting output?

like image 485
Billy Avatar asked Apr 12 '11 19:04

Billy


People also ask

What does Jsoup clean do?

clean. Creates a new, clean document, from the original dirty document, containing only elements allowed by the safelist. The original document is not modified. Only elements from the dirty document's body are used.

What is Jsoup parse?

jsoup is a Java library for working with real-world HTML. It provides a very convenient API for fetching URLs and extracting and manipulating data, using the best of HTML5 DOM methods and CSS selectors. jsoup implements the WHATWG HTML5 specification, and parses HTML to the same DOM as modern browsers do.

What is Jsoup library?

Jsoup is an open source Java library used mainly for extracting data from HTML. It also allows you to manipulate and output HTML. It has a steady development line, great documentation, and a fluent and flexible API. Jsoup can also be used to parse and build XML.


1 Answers

The real solution that preserves linebreaks should be like this:

public static String br2nl(String html) {     if(html==null)         return html;     Document document = Jsoup.parse(html);     document.outputSettings(new Document.OutputSettings().prettyPrint(false));//makes html() preserve linebreaks and spacing     document.select("br").append("\\n");     document.select("p").prepend("\\n\\n");     String s = document.html().replaceAll("\\\\n", "\n");     return Jsoup.clean(s, "", Whitelist.none(), new Document.OutputSettings().prettyPrint(false)); } 

It satisfies the following requirements:

  1. if the original html contains newline(\n), it gets preserved
  2. if the original html contains br or p tags, they gets translated to newline(\n).
like image 116
user121196 Avatar answered Sep 29 '22 16:09

user121196