Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Java Library to truncate html strings?

I need to truncate html string that was already sanitized by my app before storing in DB & contains only links, images & formatting tags. But while presenting to users, it need to be truncated for presenting an overview of content.

So I need to abbreviate html strings in java such that

<img src="http://d2qxdzx5iw7vis.cloudfront.net/34775606.jpg" />   
<br/><a href="http://d2qxdzx5iw7vis.cloudfront.net/34775606.jpg" />

when truncated does not return something like this

<img src="http://d2qxdzx5iw7vis.cloudfront.net/34775606.jpg" />   
<br/><a href="htt

but instead returns

<img src="http://d2qxdzx5iw7vis.cloudfront.net/34775606.jpg" />   
<br/>
like image 225
Rajat Gupta Avatar asked Nov 10 '22 18:11

Rajat Gupta


2 Answers

Your requirements are a bit vague, even after reading all the comments. Given your example and explanations, I assume your requirements are the following:

  • The input is a string consisting of (x)html tags. Your example doesn't contain this, but I assume the input can contain text between the tags.
  • In the context of your problem, we do not care about nesting. So the input is really only text intermingled with tags, where opening, closing and self-closing tags are all considered equivalent.
  • Tags can contain quoted values.
  • You want to truncate your string such that the string is not truncated in the middle of a tag. So in the truncated string every '<' character must have a corresponding '>' character.

I'll give you two solutions, a simple one which may not be correct, depending on what the input looks like exactly, and a more complex one which is correct.

First solution

For the first solution, we first find the last '>' character before the truncate size (this corresponds to the last tag which was completely closed). After this character may come text which does not belong to any tag, so we then search for the first '<' character after the last closed tag. In code:

public static String truncate1(String input, int size)
{
    if (input.length() < size) return input;

    int pos = input.lastIndexOf('>', size);
    int pos2 = input.indexOf('<', pos);

    if (pos2 < 0 || pos2 >= size) {
        return input.substring(0, size);
    }        
    else {
        return input.substring(0, pos2);
    }
}

Of course this solution does not consider the quoted value strings: the '<' and '>' characters might occur inside a string, in which case they should be ignored. I mention the solution anyway because you mention your input is sanatized, so possibly you can ensure that the quoted strings never contain '<' and '>' characters.

Second solution

To consider the quoted strings, we cannot rely on standard Java classes anymore, but we have to scan the input ourselves and remember if we are currently inside a tag and inside a string or not. If we encounter a '<' character outside of a string, we remember its position, so that when we reach the truncate point we know the position of the last opened tag. If that tag wasn't closed, we truncate before the beginning of that tag. In code:

public static String truncate2(String input, int size)
{
    if (input.length() < size) return input;

    int lastTagStart = 0;
    boolean inString = false;
    boolean inTag = false;

    for (int pos = 0; pos < size; pos++) {
        switch (input.charAt(pos)) {
            case '<':
                if (!inString && !inTag) {
                    lastTagStart = pos;
                    inTag = true;
                }
                break;
            case '>':
                if (!inString) inTag = false;
                break;
            case '\"':
                if (inTag) inString = !inString;
                break;
        }
    }
    if (!inTag) lastTagStart = size;
    return input.substring(0, lastTagStart);
}
like image 52
Hoopje Avatar answered Nov 15 '22 13:11

Hoopje


A robust way of doing it is to use the hotsax code which parses HTML letting you interface with the parser using the traditional low level SAX XML API [Note it is not an XML parser it parses poorly formed HTML in only chooses to let you interface with it using a standard XML API).

Here on github I have created a working quick-and-dirty example project which has a main class that parses your truncated example string:

    XMLReader parser = XMLReaderFactory.createXMLReader("hotsax.html.sax.SaxParser");

    final StringBuilder builder = new StringBuilder();

    ContentHandler handler = new DoNothingContentHandler(){

        StringBuilder wholeTag = new StringBuilder();
        boolean hasText = false;
        boolean hasElements = false;
        String lastStart = "";

        @Override
        public void characters(char[] ch, int start, int length)
                throws SAXException {
            String text = (new String(ch, start, length)).trim();
            wholeTag.append(text);
            hasText = true;
        }

        @Override
        public void endElement(String namespaceURI, String localName,
                String qName) throws SAXException {
            if( !hasText && !hasElements && lastStart.equals(localName)) {
                builder.append("<"+localName+"/>");
            } else {
                wholeTag.append("</"+ localName +">");
                builder.append(wholeTag.toString());
            }

            wholeTag = new StringBuilder();
            hasText = false;
            hasElements = false;
        }

        @Override
        public void startElement(String namespaceURI, String localName,
                String qName, Attributes atts) throws SAXException {
            wholeTag.append("<"+ localName);
            for( int i = 0; i < atts.getLength(); i++) {
                wholeTag.append(" "+atts.getQName(i)+"='"+atts.getValue(i)+"'");
                hasElements = true;
            }
            wholeTag.append(">");
            lastStart = localName;
            hasText = false;
        }

    };
    parser.setContentHandler(handler);

    //parser.parse(new InputSource( new StringReader( "<div>this is the <em>end</em> my <br> friend <a href=\"whatever\">some link</a>" ) ));
    parser.parse(new InputSource( new StringReader( "<img src=\"http://d2qxdzx5iw7vis.cloudfront.net/34775606.jpg\" />\n<br/><a href=\"htt" ) ));

    System.out.println( builder.toString() );

It outputs:

<img src='http://d2qxdzx5iw7vis.cloudfront.net/34775606.jpg'></img><br/>

It is adding an </img> tag but thats harmless for html and it would be possible to tweak the code to exactly match the input in the output if you felt that necessary.

Hotsax is actually generated code from using yacc/flex compiler tools run over the HtmlParser.y and StyleLexer.flex files which define the low level grammar of html. So you benefit from the work of the person who created that grammar; all you need to do is write some fairly trivial code and test cases to reassemble the parsed fragments as shown above. That's much better than trying to write your own regular expressions, or worst and coded string scanner, to try to interpret the string as that is very fragile.

like image 26
simbo1905 Avatar answered Nov 15 '22 11:11

simbo1905