Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

convert breaks and paragraph breaks into new line in java

Basically I have an HTML fragment with <br> and <p></p> inside. I was able to remove all the HTML tags but doing so leaves the text in a bad format.

I want something like nl2br() in PHP except reverse the input and output and also takes into account <p> tags. is there a library for it in Java?

like image 801
user91954 Avatar asked Jun 28 '10 12:06

user91954


3 Answers

You basically need to replace each <br> with \n and each <p> with \n\n. So, at the points where you succeed to remove them, you need to insert the \n and \n\n respectively.

Here's a kickoff example with help of the Jsoup HTML parser (the HTML example is intentionally written that way so that it's hard if not nearly impossible to use regex for this).

public static void main(String[] args) throws Exception {
    String originalHtml = "<p>p1l1<br/><!--</p>-->p1l2<br><!--<p>--></br><p id=p>p2l1<br class=b>p2l2</p>";
    String text = br2nl(originalHtml);
    String newHtml = nl2br(text);

    System.out.println("-------------");
    System.out.println(text);
    System.out.println("-------------");
    System.out.println(newHtml);
}

public static String br2nl(String html) {
    Document document = Jsoup.parse(html);
    document.select("br").append("\\n");
    document.select("p").prepend("\\n\\n");
    return document.text().replace("\\n", "\n");
}

public static String nl2br(String text) {
    return text.replace("\n\n", "<p>").replace("\n", "<br>");
}

(note: replaceAll() is unnecessary as we just want a simple charsequence-by-charsequence replacement here, not regexpattern-by-charsequence replacement)

Output:

<p>p1l1<br/><!--</p>-->p1l2<br><!--<p>--></br><p id=p>p2l1<br class=b>p2l2</p>
-------------


p1l1 
p1l2 



p2l1 
p2l2
-------------
<p>p1l1 <br>p1l2 <br> <br> <p>p2l1 <br>p2l2

A bit hacky, but it works.

like image 67
BalusC Avatar answered Oct 28 '22 15:10

BalusC


br2nl and p2nl are not too complicated. Give this a try:

String plain = htmlText.replaceAll("<br>","\\n").replaceAll("<p>","\\n\\n").replaceAll("</p>","");
like image 25
Andreas Dolk Avatar answered Oct 28 '22 14:10

Andreas Dolk


You should be able to use replaceAll. See http://www.rgagnon.com/javadetails/java-0454.html for an example. Just 2 of those, one for p and one for br. The example is going the other way, but you can change it around to replace the html with slash n

like image 27
Joelio Avatar answered Oct 28 '22 14:10

Joelio