Generate HTML Page with Specific Tags from Another Page using BeautifulSoup

Question

I'm exploring BeautifulSoup and aiming to retain only specific tags in an HTML file to create a new one.

I can successfully achieve this with the following program. However, I believe there might be a more suitable and natural approach without the need to manually append the strings.

from bs4 import BeautifulSoup
#soup = BeautifulSoup(page.content, 'html.parser')

with open('P:/Test.html', 'r') as f:
    contents = f.read()
    soup= BeautifulSoup(contents, 'html.parser')

NewHTML = "<html><body>"
NewHTML+="
"+str(soup.find('title'))
NewHTML+="
"+str(soup.find('p', attrs={'class': 'm-b-0'}))
NewHTML+="
"+str(soup.find('div', attrs={'id' :'right-col'}))
NewHTML+= "</body></html>"

with open("output1.html", "w") as file:
    file.write(NewHTML)

Andreas Violaris · Accepted Answer

You can have a list of desired tags, iterate through them, and use Beautiful Soup's append method to selectively include corresponding elements in the new HTML structure.

from bs4 import BeautifulSoup

with open('Test.html', 'r') as f:
    contents = f.read()
    soup = BeautifulSoup(contents, 'html.parser')

new_html = BeautifulSoup("<html><body></body></html>", 'html.parser')

tags_to_keep = ['title', {'p': {'class': 'm-b-0'}}, {'div': {'id': 'right-col'}}]

# Iterate through the tags to keep and append them to the new HTML
for tag in tags_to_keep:
    # If the tag is a string, find it in the original HTML
    # and append it to the new HTML
    if isinstance(tag, str):
        new_html.body.append(soup.find(tag))
    # If the tag is a dictionary, extract tag name and attributes,
    # then find them in the original HTML and append them to the new HTML
    elif isinstance(tag, dict):
        tag_name = list(tag.keys())[0]
        tag_attrs = tag[tag_name]
        new_html.body.append(soup.find(tag_name, attrs=tag_attrs))

with open("output1.html", "w") as file:
    file.write(str(new_html))

Assuming you have an HTML document like the one below (which would have been helpful to include for reproducibility's sake):

<!DOCTYPE html>
<head>
    <title>Test Page</title>
</head>
<body>
    <p class="m-b-0">Paragraph with class 'm-b-0'.</p>
    <div id="right-col">
        <p>Paragraph inside the 'right-col' div.</p>
    </div>
    <p>Paragraph outside the targeted tags.</p>
</body>
</html>

the resulting output1.html will contain the following content:

<html>
   <body>
      <title>Test Page</title>
      <p class="m-b-0">Paragraph with class 'm-b-0'.</p>
      <div id="right-col">
         <p>Paragraph inside the 'right-col' div.</p>
      </div>
   </body>
</html>

furas · Answer

For simple html I would use standard f-string or standard .format()

val1 = soup.find('title')
val2 = soup.find('p', attrs={'class': 'm-b-0'})
val3 = soup.find('div', attrs={'id' :'right-col'})

# f-string
new_html = f"<html><body>
{val1}
{val2}
{val3}
</body></html>"

# .format
new_html = "<html><body>
{}
{}
{}
</body></html>".format(val1, val2, val3)

Python has also old method with %

new_html = "<html><body>
%s
%s
%s
</body></html>" % (val1, val2, val3)

There is also string.Template but I never used it - because f-string can do the same.

from string import Template

template = Template('<html><body>
$item1
$item2
$item3
</body></html>')

new_html = template.substitute(item1=val1, item2=val2, item3=val3)

For something more complex I would use Jinja which is used by Flask().

from jinja2 import Environment, BaseLoader

template = '<html><body>
{{item1}}
{{item2}}
{{item3}}
</body></html>'

rtemplate = Environment(loader=BaseLoader).from_string(template)

new_html = rtemplate.render(item1=val1, item2=val2, item3=val3)

print(new_html)

It allows to use {% for %}, {% if %}, etc directly in template - so I can send all values as list or tuple and use for-loop directly in template

from jinja2 import Environment, BaseLoader

template = '<html><body>
{% for val in items %}{{val}}
{% endfor %}</body></html>'

rtemplate = Environment(loader=BaseLoader).from_string(template)

new_html = rtemplate.render(items=(val1,val2,val3))

print(new_html)

Of course you can also try to use BeautifulSoup to create HTML - see doc for append and extend - but I think that other methods are simpler. BeautifulSoup can be useful if you have already some (long) HTML and you want to replace or add some items.

Full code with all examples (except BeautifulSoup):

Instead of values from soup.find() I use literally soup.find()

val1 = "soup.find('title')"
val2 = "soup.find('p', attrs={'class': 'm-b-0'})"
val3 = "soup.find('div', attrs={'id' :'right-col'})"

print('
--- f-string ---
')
new_html = f"<html><body>
{val1}
{val2}
{val3}
</body></html>"
print(new_html)

print('
--- .format() ---
')
new_html = "<html><body>
{}
{}
{}
</body></html>".format(val1, val2, val3)
print(new_html)

print('
--- % ---
')
new_html = "<html><body>
%s
%s
%s
</body></html>" % (val1, val2, val3)
print(new_html)

# ------------------

from string import Template

template = Template('<html><body>
$item1
$item2
$item3
</body></html>')

print('
--- string.Template ---
')
new_html = template.substitute(item1=val1, item2=val2, item3=val3)
print(new_html)

# ------------------

from jinja2 import Environment, BaseLoader

print('
--- jinja ---
')

template = '<html><body>
{{item1}}
{{item2}}
{{item3}}
</body></html>'
rtemplate = Environment(loader=BaseLoader).from_string(template)

new_html = rtemplate.render(item1=val1, item2=val2, item3=val3)
print(new_html)

print('
--- jinja - {% for %} ---
')

template = '<html><body>
{% for val in items %}{{val}}
{% endfor %}</body></html>'
rtemplate = Environment(loader=BaseLoader).from_string(template)

new_html = rtemplate.render(items=(val1,val2,val3))
print(new_html)

Generate HTML Page with Specific Tags from Another Page using BeautifulSoup

Tags:

python

beautifulsoup

Villard

2 Answers

Andreas Violaris

furas

Recent Activity

Donate For Us

Generate HTML Page with Specific Tags from Another Page using BeautifulSoup

Tags:

python

beautifulsoup

Villard

2 Answers

Andreas Violaris

furas

Related questions

Recent Activity

Donate For Us