I'm trying to generate a table of contents from a block of HTML (not a complete file - just content) based on its <h2>
and <h3>
tags.
My plan so far was to:
Extract a list of headers using beautifulsoup
Use a regex on the content to place anchor links before/inside the header tags (so the user can click on the table of contents) -- There might be a method for replacing inside beautifulsoup
?
Output a nested list of links to the headers in a predefined spot.
It sounds easy when I say it like that, but it's proving to be a bit of a pain in the rear.
Is there something out there that does all this for me in one go so I don't waste the next couple of hours reinventing the wheel?
A example:
<p>This is an introduction</p>
<h2>This is a sub-header</h2>
<p>...</p>
<h3>This is a sub-sub-header</h3>
<p>...</p>
<h2>This is a sub-header</h2>
<p>...</p>
Some quickly hacked ugly piece of code:
soup = BeautifulSoup(html)
toc = []
header_id = 1
current_list = toc
previous_tag = None
for header in soup.findAll(['h2', 'h3']):
header['id'] = header_id
if previous_tag == 'h2' and header.name == 'h3':
current_list = []
elif previous_tag == 'h3' and header.name == 'h2':
toc.append(current_list)
current_list = toc
current_list.append((header_id, header.string))
header_id += 1
previous_tag = header.name
if current_list != toc:
toc.append(current_list)
def list_to_html(lst):
result = ["<ul>"]
for item in lst:
if isinstance(item, list):
result.append(list_to_html(item))
else:
result.append('<li><a href="#%s">%s</a></li>' % item)
result.append("</ul>")
return "\n".join(result)
# Table of contents
print list_to_html(toc)
# Modified HTML
print soup
Use lxml.html
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With