Generate a table of contents from HTML with Python

Question

I'm trying to generate a table of contents from a block of HTML (not a complete file - just content) based on its <h2> and <h3> tags.

My plan so far was to:

Extract a list of headers using beautifulsoup
Use a regex on the content to place anchor links before/inside the header tags (so the user can click on the table of contents) -- There might be a method for replacing inside beautifulsoup?
Output a nested list of links to the headers in a predefined spot.

It sounds easy when I say it like that, but it's proving to be a bit of a pain in the rear.

Is there something out there that does all this for me in one go so I don't waste the next couple of hours reinventing the wheel?

A example:

<p>This is an introduction</p>

<h2>This is a sub-header</h2>
<p>...</p>

<h3>This is a sub-sub-header</h3>
<p>...</p>

<h2>This is a sub-header</h2>
<p>...</p>

Łukasz · Accepted Answer

Some quickly hacked ugly piece of code:

soup = BeautifulSoup(html)

toc = []
header_id = 1
current_list = toc
previous_tag = None

for header in soup.findAll(['h2', 'h3']):
    header['id'] = header_id

    if previous_tag == 'h2' and header.name == 'h3':
        current_list = []
    elif previous_tag == 'h3' and header.name == 'h2':
        toc.append(current_list)
        current_list = toc

    current_list.append((header_id, header.string))

    header_id += 1
    previous_tag = header.name

if current_list != toc:
    toc.append(current_list)


def list_to_html(lst):
    result = ["<ul>"]
    for item in lst:
        if isinstance(item, list):
            result.append(list_to_html(item))
        else:
            result.append('<li><a href="#%s">%s</a></li>' % item)
    result.append("</ul>")
    return "
".join(result)

# Table of contents
print list_to_html(toc)

# Modified HTML
print soup

nosklo · Answer

Use lxml.html.

It can deal with invalid html just fine.
It is very fast.
It allows you to easily create the missing elements and move elements around between the trees.

Generate a table of contents from HTML with Python

Tags:

python

html

beautifulsoup

tableofcontents

Oli

2 Answers

Łukasz

nosklo

Recent Activity

Donate For Us

Generate a table of contents from HTML with Python

Tags:

python

html

beautifulsoup

tableofcontents

Oli

2 Answers

Łukasz

nosklo

Related questions

Recent Activity

Donate For Us