Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

HTML Truncating in Python

Is there a pure-Python tool to take some HTML and truncate it as close to a given length as possible, but make sure the resulting snippet is well-formed? For example, given this HTML:

<h1>This is a header</h1>
<p>This is a paragraph</p>

it would not produce:

<h1>This is a hea

but:

<h1>This is a header</h1>

or at least:

<h1>This is a hea</h1>

I can't find one that works, though I found one that relies on pullparser, which is both obsolete and dead.

like image 990
JasonFruit Avatar asked Feb 11 '11 14:02

JasonFruit


2 Answers

I don't think you need a full-fledged parser - you only need to tokenize the the input string into one of:

  • text
  • open tag
  • close tag
  • self-closing tag
  • character entity

Once you have a stream of tokens like that, it's easy to use a stack to keep track of what tags need closing. I actually ran into this problem a while ago and wrote a small library to do this:

https://github.com/eentzel/htmltruncate.py

It's worked well for me, and handles most of the corner cases well, including arbitrarily nested markup, counting character entities as a single character, returning an error on malformed markup, etc.

It will produce:

<h1>This is a hea</h1>

on your example. This could perhaps be changed, but it's hard in the general case - what if you're trying to truncate to 10 characters, but the <h1> tag isn't closed for another, say, 300 characters?

like image 127
eentzel Avatar answered Oct 08 '22 09:10

eentzel


If you're using DJANGO lib, you can simply :

from django.utils import text, html

    class class_name():


        def trim_string(self, stringf, limit, offset = 0):
            return stringf[offset:limit]

        def trim_html_words(self, html, limit, offset = 0):
            return text.truncate_html_words(html, limit)


        def remove_html(self, htmls, tag, limit = 'all', offset = 0):
            return html.strip_tags(htmls)

Anyways, here's the code from truncate_html_words from django :

import re

def truncate_html_words(s, num):
    """
    Truncates html to a certain number of words (not counting tags and comments).
    Closes opened tags if they were correctly closed in the given html.
    """
    length = int(num)
    if length <= 0:
        return ''
    html4_singlets = ('br', 'col', 'link', 'base', 'img', 'param', 'area', 'hr', 'input')
    # Set up regular expressions
    re_words = re.compile(r'&.*?;|<.*?>|([A-Za-z0-9][\w-]*)')
    re_tag = re.compile(r'<(/)?([^ ]+?)(?: (/)| .*?)?>')
    # Count non-HTML words and keep note of open tags
    pos = 0
    ellipsis_pos = 0
    words = 0
    open_tags = []
    while words <= length:
        m = re_words.search(s, pos)
        if not m:
            # Checked through whole string
            break
        pos = m.end(0)
        if m.group(1):
            # It's an actual non-HTML word
            words += 1
            if words == length:
                ellipsis_pos = pos
            continue
        # Check for tag
        tag = re_tag.match(m.group(0))
        if not tag or ellipsis_pos:
            # Don't worry about non tags or tags after our truncate point
            continue
        closing_tag, tagname, self_closing = tag.groups()
        tagname = tagname.lower()  # Element names are always case-insensitive
        if self_closing or tagname in html4_singlets:
            pass
        elif closing_tag:
            # Check for match in open tags list
            try:
                i = open_tags.index(tagname)
            except ValueError:
                pass
            else:
                # SGML: An end tag closes, back to the matching start tag, all unclosed intervening start tags with omitted end tags
                open_tags = open_tags[i+1:]
        else:
            # Add it to the start of the open tags list
            open_tags.insert(0, tagname)
    if words <= length:
        # Don't try to close tags if we don't need to truncate
        return s
    out = s[:ellipsis_pos] + ' ...'
    # Close any tags still open
    for tag in open_tags:
        out += '</%s>' % tag
    # Return string
    return out
like image 25
vertazzar Avatar answered Oct 08 '22 09:10

vertazzar