Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Trim string to length ignoring HTML

This problem is a challenging one. Our application allows users to post news on the homepage. That news is input via a rich text editor which allows HTML. On the homepage we want to only display a truncated summary of the news item.

For example, here is the full text we are displaying, including HTML


In an attempt to make a bit more space in the office, kitchen, I've pulled out all of the random mugs and put them onto the lunch room table. Unless you feel strongly about the ownership of that Cheyenne Courier mug from 1992 or perhaps that BC Tel Advanced Communications mug from 1997, they will be put in a box and donated to an office in more need of mugs than us.

We want to trim the news item to 250 characters, but exclude HTML.

The method we are using for trimming currently includes the HTML, and this results in some news posts that are HTML heavy getting truncated considerably.

For instance, if the above example included tons of HTML, it could potentially look like this:

In an attempt to make a bit more space in the office, kitchen, I've pulled...

This is not what we want.

Does anyone have a way of tokenizing HTML tags in order to maintain position in the string, perform a length check and/or trim on the string, and restore the HTML inside the string at its old location?

like image 802
steve_c Avatar asked Apr 09 '09 22:04

steve_c


2 Answers

Start at the first character of the post, stepping over each character. Every time you step over a character, increment a counter. When you find a '<' character, stop incrementing the counter until you hit a '>' character. Your position when the counter gets to 250 is where you actually want to cut off.

Take note that this will have another problem that you'll have to deal with when an HTML tag is opened but not closed before the cutoff.

like image 173
Chad Birch Avatar answered Nov 15 '22 19:11

Chad Birch


Following the 2-state finite machine suggestion, I've just developed a simple HTML parser for this purpose, in Java:

http://pastebin.com/jCRqiwNH

and here a test case:

http://pastebin.com/37gCS4tV

And here the Java code:

import java.util.Collections;
import java.util.LinkedList;
import java.util.List;

public class HtmlShortener {

    private static final String TAGS_TO_SKIP = "br,hr,img,link";
    private static final String[] tagsToSkip = TAGS_TO_SKIP.split(",");
    private static final int STATUS_READY = 0;

        private int cutPoint = -1;
    private String htmlString = "";

    final List<String> tags = new LinkedList<String>();

    StringBuilder sb = new StringBuilder("");
    StringBuilder tagSb = new StringBuilder("");

    int charCount = 0;
    int status = STATUS_READY;

    public HtmlShortener(String htmlString, int cutPoint){
        this.cutPoint = cutPoint;
        this.htmlString = htmlString;
    }

    public String cut(){

        // reset 
        tags.clear();
        sb = new StringBuilder("");
        tagSb = new StringBuilder("");
        charCount = 0;
        status = STATUS_READY;

        String tag = "";

        if (cutPoint < 0){
            return htmlString;
        }

        if (null != htmlString){

            if (cutPoint == 0){
                return "";
            }

            for (int i = 0; i < htmlString.length(); i++){

                String strC = htmlString.substring(i, i+1);


                if (strC.equals("<")){

                    // new tag or tag closure

                    // previous tag reset
                    tagSb = new StringBuilder("");
                    tag = "";

                    // find tag type and name
                    for (int k = i; k < htmlString.length(); k++){

                        String tagC = htmlString.substring(k, k+1);
                        tagSb.append(tagC);

                        if (tagC.equals(">")){
                            tag = getTag(tagSb.toString());
                            if (tag.startsWith("/")){

                                // closure
                                if (!isToSkip(tag)){
                                    sb.append("</").append(tags.get(tags.size() - 1)).append(">");
                                    tags.remove((tags.size() - 1));
                                }

                            } else {

                                // new tag
                                sb.append(tagSb.toString());

                                if (!isToSkip(tag)){
                                    tags.add(tag);  
                                }

                            }

                            i = k;
                            break;
                        }

                    }

                } else {

                    sb.append(strC);
                    charCount++;

                }

                // cut check
                if (charCount >= cutPoint){

                    // close previously open tags
                    Collections.reverse(tags);
                    for (String t : tags){
                        sb.append("</").append(t).append(">");
                    }
                    break;
                } 

            }

            return sb.toString();

        } else {
            return null;
        }

    }

    private boolean isToSkip(String tag) {

        if (tag.startsWith("/")){
            tag = tag.substring(1, tag.length());
        }

        for (String tagToSkip : tagsToSkip){
            if (tagToSkip.equals(tag)){
                return true;
            }
        }

        return false;
    }

    private String getTag(String tagString) {

        if (tagString.contains(" ")){
            // tag with attributes
            return tagString.substring(tagString.indexOf("<") + 1, tagString.indexOf(" "));
        } else {
            // simple tag
            return tagString.substring(tagString.indexOf("<") + 1, tagString.indexOf(">"));
        }


    }

}

like image 41
Matteo Pelucco Avatar answered Nov 15 '22 20:11

Matteo Pelucco