This problem is a challenging one. Our application allows users to post news on the homepage. That news is input via a rich text editor which allows HTML. On the homepage we want to only display a truncated summary of the news item.
For example, here is the full text we are displaying, including HTML
In an attempt to make a bit more space in the office, kitchen, I've pulled out all of the random mugs and put them onto the lunch room table. Unless you feel strongly about the ownership of that Cheyenne Courier mug from 1992 or perhaps that BC Tel Advanced Communications mug from 1997, they will be put in a box and donated to an office in more need of mugs than us.
We want to trim the news item to 250 characters, but exclude HTML.
The method we are using for trimming currently includes the HTML, and this results in some news posts that are HTML heavy getting truncated considerably.
For instance, if the above example included tons of HTML, it could potentially look like this:
In an attempt to make a bit more space in the office, kitchen, I've pulled...
This is not what we want.
Does anyone have a way of tokenizing HTML tags in order to maintain position in the string, perform a length check and/or trim on the string, and restore the HTML inside the string at its old location?
Start at the first character of the post, stepping over each character. Every time you step over a character, increment a counter. When you find a '<' character, stop incrementing the counter until you hit a '>' character. Your position when the counter gets to 250 is where you actually want to cut off.
Take note that this will have another problem that you'll have to deal with when an HTML tag is opened but not closed before the cutoff.
Following the 2-state finite machine suggestion, I've just developed a simple HTML parser for this purpose, in Java:
http://pastebin.com/jCRqiwNH
and here a test case:
http://pastebin.com/37gCS4tV
And here the Java code:
import java.util.Collections;
import java.util.LinkedList;
import java.util.List;
public class HtmlShortener {
private static final String TAGS_TO_SKIP = "br,hr,img,link";
private static final String[] tagsToSkip = TAGS_TO_SKIP.split(",");
private static final int STATUS_READY = 0;
private int cutPoint = -1;
private String htmlString = "";
final List<String> tags = new LinkedList<String>();
StringBuilder sb = new StringBuilder("");
StringBuilder tagSb = new StringBuilder("");
int charCount = 0;
int status = STATUS_READY;
public HtmlShortener(String htmlString, int cutPoint){
this.cutPoint = cutPoint;
this.htmlString = htmlString;
}
public String cut(){
// reset
tags.clear();
sb = new StringBuilder("");
tagSb = new StringBuilder("");
charCount = 0;
status = STATUS_READY;
String tag = "";
if (cutPoint < 0){
return htmlString;
}
if (null != htmlString){
if (cutPoint == 0){
return "";
}
for (int i = 0; i < htmlString.length(); i++){
String strC = htmlString.substring(i, i+1);
if (strC.equals("<")){
// new tag or tag closure
// previous tag reset
tagSb = new StringBuilder("");
tag = "";
// find tag type and name
for (int k = i; k < htmlString.length(); k++){
String tagC = htmlString.substring(k, k+1);
tagSb.append(tagC);
if (tagC.equals(">")){
tag = getTag(tagSb.toString());
if (tag.startsWith("/")){
// closure
if (!isToSkip(tag)){
sb.append("</").append(tags.get(tags.size() - 1)).append(">");
tags.remove((tags.size() - 1));
}
} else {
// new tag
sb.append(tagSb.toString());
if (!isToSkip(tag)){
tags.add(tag);
}
}
i = k;
break;
}
}
} else {
sb.append(strC);
charCount++;
}
// cut check
if (charCount >= cutPoint){
// close previously open tags
Collections.reverse(tags);
for (String t : tags){
sb.append("</").append(t).append(">");
}
break;
}
}
return sb.toString();
} else {
return null;
}
}
private boolean isToSkip(String tag) {
if (tag.startsWith("/")){
tag = tag.substring(1, tag.length());
}
for (String tagToSkip : tagsToSkip){
if (tagToSkip.equals(tag)){
return true;
}
}
return false;
}
private String getTag(String tagString) {
if (tagString.contains(" ")){
// tag with attributes
return tagString.substring(tagString.indexOf("<") + 1, tagString.indexOf(" "));
} else {
// simple tag
return tagString.substring(tagString.indexOf("<") + 1, tagString.indexOf(">"));
}
}
}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With