Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Truncate Markdown?

I have a Rails site, where the content is written in markdown. I wish to display a snippet of each, with a "Read more.." link.

How do I go about this? Simple truncating the raw text will not work, for example..

>> "This is an [example](http://example.com)"[0..25]
=> "This is an [example](http:"

Ideally I want to allow the author to (optionally) insert a marker to specify what to use as the "snippet", if not it would take 250 words, and append "..." - for example..

This article is an example of something or other.

This segment will be used as the snippet on the index page.

^^^^^^^^^^^^^^^

This text will be visible once clicking the "Read more.." link

The marker could be thought of like an EOF marker (which can be ignored when displaying the full document)

I am using maruku for the Markdown processing (RedCloth is very biased towards Textile, BlueCloth is extremely buggy, and I wanted a native-Ruby parser which ruled out peg-markdown and RDiscount)

Alternatively (since the Markdown is translated to HTML anyway) truncating the HTML correctly would be an option - although it would be preferable to not markdown() the entire document, just to get the first few lines.

So, the options I can think of are (in order of preference)..

  • Add a "truncate" option to the maruku parser, which will only parse the first x words, or till the "excerpt" marker.
  • Write/find a parser-agnostic Markdown truncate'r
  • Write/find an intelligent HTML truncating function
like image 405
dbr Avatar asked Dec 28 '08 03:12

dbr


2 Answers

  • Write/find an intelligent HTML truncating function

The following from http://mikeburnscoder.wordpress.com/2006/11/11/truncating-html-in-ruby/, with some modifications will correctly truncate HTML, and easily allow appending a string before the closing tags.

>> puts "<p><b><a href=\"hi\">Something</a></p>".truncate_html(5, at_end = "...")
=> <p><b><a href="hi">Someth...</a></b></p>

The modified code:

require 'rexml/parsers/pullparser'

class String
  def truncate_html(len = 30, at_end = nil)
    p = REXML::Parsers::PullParser.new(self)
    tags = []
    new_len = len
    results = ''
    while p.has_next? && new_len > 0
      p_e = p.pull
      case p_e.event_type
      when :start_element
        tags.push p_e[0]
        results << "<#{tags.last}#{attrs_to_s(p_e[1])}>"
      when :end_element
        results << "</#{tags.pop}>"
      when :text
        results << p_e[0][0..new_len]
        new_len -= p_e[0].length
      else
        results << "<!-- #{p_e.inspect} -->"
      end
    end
    if at_end
      results << "..."
    end
    tags.reverse.each do |tag|
      results << "</#{tag}>"
    end
    results
  end

  private

  def attrs_to_s(attrs)
    if attrs.empty?
      ''
    else
      ' ' + attrs.to_a.map { |attr| %{#{attr[0]}="#{attr[1]}"} }.join(' ')
    end
  end
end
like image 98
dbr Avatar answered Sep 17 '22 08:09

dbr


Here's a solution that works for me with Textile.

  1. Convert it to HTML
  2. Truncate it.
  3. Remove any HTML tags that got cut in half with

    html_string.gsub(/<[^>]*$/, "")
    
  4. Then, uses Hpricot to clean it up and close unclosed tags

    html_string = Hpricot( html_string ).to_s 
    

I do this in a helper, and with caching there's no performance issue.

like image 37
nicholaides Avatar answered Sep 20 '22 08:09

nicholaides