Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Generate a table of contents from HTML with Python

I'm trying to generate a table of contents from a block of HTML (not a complete file - just content) based on its <h2> and <h3> tags.

My plan so far was to:

  • Extract a list of headers using beautifulsoup

  • Use a regex on the content to place anchor links before/inside the header tags (so the user can click on the table of contents) -- There might be a method for replacing inside beautifulsoup?

  • Output a nested list of links to the headers in a predefined spot.

It sounds easy when I say it like that, but it's proving to be a bit of a pain in the rear.

Is there something out there that does all this for me in one go so I don't waste the next couple of hours reinventing the wheel?

A example:

<p>This is an introduction</p>

<h2>This is a sub-header</h2>
<p>...</p>

<h3>This is a sub-sub-header</h3>
<p>...</p>

<h2>This is a sub-header</h2>
<p>...</p>
like image 372
Oli Avatar asked Mar 25 '10 11:03

Oli


2 Answers

Some quickly hacked ugly piece of code:

soup = BeautifulSoup(html)

toc = []
header_id = 1
current_list = toc
previous_tag = None

for header in soup.findAll(['h2', 'h3']):
    header['id'] = header_id

    if previous_tag == 'h2' and header.name == 'h3':
        current_list = []
    elif previous_tag == 'h3' and header.name == 'h2':
        toc.append(current_list)
        current_list = toc

    current_list.append((header_id, header.string))

    header_id += 1
    previous_tag = header.name

if current_list != toc:
    toc.append(current_list)


def list_to_html(lst):
    result = ["<ul>"]
    for item in lst:
        if isinstance(item, list):
            result.append(list_to_html(item))
        else:
            result.append('<li><a href="#%s">%s</a></li>' % item)
    result.append("</ul>")
    return "\n".join(result)

# Table of contents
print list_to_html(toc)

# Modified HTML
print soup
like image 129
Łukasz Avatar answered Oct 14 '22 04:10

Łukasz


Use lxml.html.

  • It can deal with invalid html just fine.
  • It is very fast.
  • It allows you to easily create the missing elements and move elements around between the trees.
like image 34
nosklo Avatar answered Oct 14 '22 04:10

nosklo