Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to convert HTML to text keeping linebreaks

Tags:

java

html

How may I convert HTML to text keeping linebreaks (produced by elements like br,p,div, ...) possibly using NekoHTML or any decent enough HTML parser

Example:
Hello<br/>World
to:

Hello\n  
World  
like image 224
Eduardo Avatar asked Mar 25 '10 07:03

Eduardo


People also ask

How to preserve line breaks in HTML?

The <pre> tag defines preformatted text. Text in a <pre> element is displayed in a fixed-width font, and the text preserves both spaces and line breaks.

How to preserve\ n in HTML?

Preserve Newlines, Line Breaks, and Whitespace in HTML If you want your text to overflow the parent's boundaries, you should use pre as your CSS whitespace property. Using white-space: pre wraps still preserves newlines and spaces.

How to convert HTML to formatted plain text JavaScript?

The easiest way would be to strip all the HTML tags using the replace() method of JavaScript. It finds all tags enclosed in angle brackets and replaces them with a space. var text = html.


2 Answers

Here is a function I made to output text (including line breaks) by iterating over the nodes using Jsoup.

public static String htmlToText(InputStream html) throws IOException {
    Document document = Jsoup.parse(html, null, "");
    Element body = document.body();

    return buildStringFromNode(body).toString();
}

private static StringBuffer buildStringFromNode(Node node) {
    StringBuffer buffer = new StringBuffer();

    if (node instanceof TextNode) {
        TextNode textNode = (TextNode) node;
        buffer.append(textNode.text().trim());
    }

    for (Node childNode : node.childNodes()) {
        buffer.append(buildStringFromNode(childNode));
    }

    if (node instanceof Element) {
        Element element = (Element) node;
        String tagName = element.tagName();
        if ("p".equals(tagName) || "br".equals(tagName)) {
            buffer.append("\n");
        }
    }

    return buffer;
}
like image 186
jasop Avatar answered Oct 14 '22 23:10

jasop


w3m -dump -no-cookie input.html > output.txt
like image 34
weakish Avatar answered Oct 14 '22 21:10

weakish