I need to truncate html string that was already sanitized by my app before storing in DB & contains only links, images & formatting tags. But while presenting to users, it need to be truncated for presenting an overview of content.
So I need to abbreviate html strings in java such that
<img src="http://d2qxdzx5iw7vis.cloudfront.net/34775606.jpg" />
<br/><a href="http://d2qxdzx5iw7vis.cloudfront.net/34775606.jpg" />
when truncated does not return something like this
<img src="http://d2qxdzx5iw7vis.cloudfront.net/34775606.jpg" />
<br/><a href="htt
but instead returns
<img src="http://d2qxdzx5iw7vis.cloudfront.net/34775606.jpg" />
<br/>
Your requirements are a bit vague, even after reading all the comments. Given your example and explanations, I assume your requirements are the following:
I'll give you two solutions, a simple one which may not be correct, depending on what the input looks like exactly, and a more complex one which is correct.
For the first solution, we first find the last '>' character before the truncate size (this corresponds to the last tag which was completely closed). After this character may come text which does not belong to any tag, so we then search for the first '<' character after the last closed tag. In code:
public static String truncate1(String input, int size)
{
if (input.length() < size) return input;
int pos = input.lastIndexOf('>', size);
int pos2 = input.indexOf('<', pos);
if (pos2 < 0 || pos2 >= size) {
return input.substring(0, size);
}
else {
return input.substring(0, pos2);
}
}
Of course this solution does not consider the quoted value strings: the '<' and '>' characters might occur inside a string, in which case they should be ignored. I mention the solution anyway because you mention your input is sanatized, so possibly you can ensure that the quoted strings never contain '<' and '>' characters.
To consider the quoted strings, we cannot rely on standard Java classes anymore, but we have to scan the input ourselves and remember if we are currently inside a tag and inside a string or not. If we encounter a '<' character outside of a string, we remember its position, so that when we reach the truncate point we know the position of the last opened tag. If that tag wasn't closed, we truncate before the beginning of that tag. In code:
public static String truncate2(String input, int size)
{
if (input.length() < size) return input;
int lastTagStart = 0;
boolean inString = false;
boolean inTag = false;
for (int pos = 0; pos < size; pos++) {
switch (input.charAt(pos)) {
case '<':
if (!inString && !inTag) {
lastTagStart = pos;
inTag = true;
}
break;
case '>':
if (!inString) inTag = false;
break;
case '\"':
if (inTag) inString = !inString;
break;
}
}
if (!inTag) lastTagStart = size;
return input.substring(0, lastTagStart);
}
A robust way of doing it is to use the hotsax code which parses HTML letting you interface with the parser using the traditional low level SAX XML API [Note it is not an XML parser it parses poorly formed HTML in only chooses to let you interface with it using a standard XML API).
Here on github I have created a working quick-and-dirty example project which has a main class that parses your truncated example string:
XMLReader parser = XMLReaderFactory.createXMLReader("hotsax.html.sax.SaxParser");
final StringBuilder builder = new StringBuilder();
ContentHandler handler = new DoNothingContentHandler(){
StringBuilder wholeTag = new StringBuilder();
boolean hasText = false;
boolean hasElements = false;
String lastStart = "";
@Override
public void characters(char[] ch, int start, int length)
throws SAXException {
String text = (new String(ch, start, length)).trim();
wholeTag.append(text);
hasText = true;
}
@Override
public void endElement(String namespaceURI, String localName,
String qName) throws SAXException {
if( !hasText && !hasElements && lastStart.equals(localName)) {
builder.append("<"+localName+"/>");
} else {
wholeTag.append("</"+ localName +">");
builder.append(wholeTag.toString());
}
wholeTag = new StringBuilder();
hasText = false;
hasElements = false;
}
@Override
public void startElement(String namespaceURI, String localName,
String qName, Attributes atts) throws SAXException {
wholeTag.append("<"+ localName);
for( int i = 0; i < atts.getLength(); i++) {
wholeTag.append(" "+atts.getQName(i)+"='"+atts.getValue(i)+"'");
hasElements = true;
}
wholeTag.append(">");
lastStart = localName;
hasText = false;
}
};
parser.setContentHandler(handler);
//parser.parse(new InputSource( new StringReader( "<div>this is the <em>end</em> my <br> friend <a href=\"whatever\">some link</a>" ) ));
parser.parse(new InputSource( new StringReader( "<img src=\"http://d2qxdzx5iw7vis.cloudfront.net/34775606.jpg\" />\n<br/><a href=\"htt" ) ));
System.out.println( builder.toString() );
It outputs:
<img src='http://d2qxdzx5iw7vis.cloudfront.net/34775606.jpg'></img><br/>
It is adding an </img>
tag but thats harmless for html and it would be possible to tweak the code to exactly match the input in the output if you felt that necessary.
Hotsax is actually generated code from using yacc/flex compiler tools run over the HtmlParser.y and StyleLexer.flex files which define the low level grammar of html. So you benefit from the work of the person who created that grammar; all you need to do is write some fairly trivial code and test cases to reassemble the parsed fragments as shown above. That's much better than trying to write your own regular expressions, or worst and coded string scanner, to try to interpret the string as that is very fragile.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With