I have a database full of small HTML documents and I need to programmatically insert several into, say, a PDF document with iText or a Word document with Aspose.Words. I need to preserve any formatting within the HTML documents (within reason, honouring <b> tags is a must, CSS like <span style="blah"> is a nice-to-have).
Both iText and Aspose work (roughly) along the lines:
Document document = new Document( Size.A4, Aspect.PORTRAIT );
document.setFont( "Helvetica", 20, Font.BOLD );
document.insert( "some string" )
document.setBold( true );
document.insert( "A bold string" );
Therefore (I think) I need some kind of HTML parser which will I can inspect for strings and styles to insert into my document.
Can anybody suggest a good library or a sensible approach to this problem? Platform is Java
HTMLparser is a good HTML parser.
I have used this to parse HTML on one of my projects.
You can write your own filters to parse the HTML for what you want, so the
<br>
tag shouldn't be difficult to parse out
Yo can parse out CSS usin the CssSelectorNodeFilter
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With