A user enters text as HTML in a form, for example:
<p>this is my <strong>blog</strong> post,
very <i>long</i> and written in <b>HTML</b></p>
I want to be able to output only a part of the string ( for example the first 20 characters ) without breaking the HTML structure of the user's input. In this case:
<p>this is my <strong>blog</strong> post, very <i>l</i>...</p>
which renders as
this is my <strong>blog</strong> post, very <i>lo</i>...
Is there a Java library able to do this, or a simple method to use?
MyLibrary.abbreviateHTML(string,20) ?
Since it's not very easy to do this correctly I usually strip all tags and truncate. This gives great control on the text size and appearance which usually needs to be placed in places where you do need control.
Note that you may find my proposal very conservative and it actually is not a proper answer to your question. But most of the times the alternatives are:
The reason that truncating HTML would be hard is that you don't know how truncating would affect the structure of the HTML. How would you truncate in the middle of a <ul>
or, even worst, in the middle of a complex <table>
?
So the problem here is that HTML can not only contain content and styling (bold, italics) but also structure (lists, tables, divs etc). So a good and safe implementation would be to strip everything out apart inline "styling" tags (bold, italics etc) and truncate while keeping track of unclosed tags.
I don't know any library but it should not be so complicated (for 80%). You only need a simple "parser" that understand 4 type of tokens:
<
but not </
and ends with >
but not />
</
and ends with >
<br/>
) - everything that starts with <
but not </
and ends with />
but not >
Then you must walk through your input string, and count the "normal characters". While you walking along the string and count, you copy every token to the output as long as the counted normal chars are less or equals the amount you want to have.
You also need to build a stack of current open tags, while you walk thought the input. Every time you walk trough a "opening tag" you put it to the stack (its name), every time you you find a closing tag, you remove the topmost tag name from the stack (hopefully the input is correct XHTML).
When you reach the end of the required amount of normal chars, then you only need to write closing HTML tags for the tag names remaining on the stack.
But be careful, this works only with the input is well-formed XML.
I don't know what you want to do with this piece of code, but you should pay attention to HTML/JavaScript injection attacks.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With