I have html file:
<html><head></head><body><div style="font-family: Verdana;font-size: 12.0px;">
<div>Test message.</div>
<div> </div>
<div>More content here...</div>
<div> </div>
<div>Best regards,</div>
<div>Mr. Crowley</div></div></body></html>
I try to get content of the file above using Apache Tika...
final InputStream input = new FileInputStream("file.html");
final ContentHandler handler = new BodyContentHandler();
final Metadata metadata = new Metadata();
final HtmlParser htmlParser = new HtmlParser();
htmlParser.parse(input, handler, metadata, new ParseContext());
String plainText = handler.toString();
System.out.println(plainText);
...and all is fine except extra linebreaks:
Test message.
More content here...
Best regards,
Mr. Crowley
<and 3 empty lines here>
Is it possible to avoid this behavior? Is it possible to get more expected result:
Test message.
More content here...
Best regards,
Mr. Crowley
?
Code constructions like
plainText = plainText.replaceAll("(\n)+", "\n");
are unfortunately impossible here for me. Also I can't change the structure of my HTML file.
One solution is to implement custom ContentHandler which would not write those new lines (still new lines from the original document will be kept):
public class OriginalBodyContentHandler extends BodyContentHandler {
@Override
public void ignorableWhitespace(char[] ch, int start, int length)
throws SAXException {
// Not writing extra new lines generated by XHTMLContentHandler.
}
}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With