How may I convert HTML to text keeping linebreaks (produced by elements like br,p,div, ...) possibly using NekoHTML or any decent enough HTML parser
Example:Hello<br/>World
to:
Hello\n
World
The <pre> tag defines preformatted text. Text in a <pre> element is displayed in a fixed-width font, and the text preserves both spaces and line breaks.
Preserve Newlines, Line Breaks, and Whitespace in HTML If you want your text to overflow the parent's boundaries, you should use pre as your CSS whitespace property. Using white-space: pre wraps still preserves newlines and spaces.
The easiest way would be to strip all the HTML tags using the replace() method of JavaScript. It finds all tags enclosed in angle brackets and replaces them with a space. var text = html.
Here is a function I made to output text (including line breaks) by iterating over the nodes using Jsoup.
public static String htmlToText(InputStream html) throws IOException {
Document document = Jsoup.parse(html, null, "");
Element body = document.body();
return buildStringFromNode(body).toString();
}
private static StringBuffer buildStringFromNode(Node node) {
StringBuffer buffer = new StringBuffer();
if (node instanceof TextNode) {
TextNode textNode = (TextNode) node;
buffer.append(textNode.text().trim());
}
for (Node childNode : node.childNodes()) {
buffer.append(buildStringFromNode(childNode));
}
if (node instanceof Element) {
Element element = (Element) node;
String tagName = element.tagName();
if ("p".equals(tagName) || "br".equals(tagName)) {
buffer.append("\n");
}
}
return buffer;
}
w3m -dump -no-cookie input.html > output.txt
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With