Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do I programmatically inspect a HTML document

Tags:

java

html

parsing

I have a database full of small HTML documents and I need to programmatically insert several into, say, a PDF document with iText or a Word document with Aspose.Words. I need to preserve any formatting within the HTML documents (within reason, honouring <b> tags is a must, CSS like <span style="blah"> is a nice-to-have).

Both iText and Aspose work (roughly) along the lines:

Document document = new Document( Size.A4, Aspect.PORTRAIT );

document.setFont( "Helvetica", 20, Font.BOLD );
document.insert( "some string" )
document.setBold( true );
document.insert( "A bold string" );

Therefore (I think) I need some kind of HTML parser which will I can inspect for strings and styles to insert into my document.

Can anybody suggest a good library or a sensible approach to this problem? Platform is Java

like image 832
banjollity Avatar asked Oct 20 '08 13:10

banjollity


1 Answers

HTMLparser is a good HTML parser.

I have used this to parse HTML on one of my projects.

You can write your own filters to parse the HTML for what you want, so the <br> tag shouldn't be difficult to parse out

Yo can parse out CSS usin the CssSelectorNodeFilter

like image 70
Craig Angus Avatar answered Oct 09 '22 21:10

Craig Angus