I'm exploring BeautifulSoup and aiming to retain only specific tags in an HTML file to create a new one.
I can successfully achieve this with the following program. However, I believe there might be a more suitable and natural approach without the need to manually append the strings.
from bs4 import BeautifulSoup
#soup = BeautifulSoup(page.content, 'html.parser')
with open('P:/Test.html', 'r') as f:
contents = f.read()
soup= BeautifulSoup(contents, 'html.parser')
NewHTML = "<html><body>"
NewHTML+="\n"+str(soup.find('title'))
NewHTML+="\n"+str(soup.find('p', attrs={'class': 'm-b-0'}))
NewHTML+="\n"+str(soup.find('div', attrs={'id' :'right-col'}))
NewHTML+= "</body></html>"
with open("output1.html", "w") as file:
file.write(NewHTML)
You can have a list of desired tags, iterate through them, and use Beautiful Soup's append method to selectively include corresponding elements in the new HTML structure.
from bs4 import BeautifulSoup
with open('Test.html', 'r') as f:
contents = f.read()
soup = BeautifulSoup(contents, 'html.parser')
new_html = BeautifulSoup("<html><body></body></html>", 'html.parser')
tags_to_keep = ['title', {'p': {'class': 'm-b-0'}}, {'div': {'id': 'right-col'}}]
# Iterate through the tags to keep and append them to the new HTML
for tag in tags_to_keep:
# If the tag is a string, find it in the original HTML
# and append it to the new HTML
if isinstance(tag, str):
new_html.body.append(soup.find(tag))
# If the tag is a dictionary, extract tag name and attributes,
# then find them in the original HTML and append them to the new HTML
elif isinstance(tag, dict):
tag_name = list(tag.keys())[0]
tag_attrs = tag[tag_name]
new_html.body.append(soup.find(tag_name, attrs=tag_attrs))
with open("output1.html", "w") as file:
file.write(str(new_html))
Assuming you have an HTML document like the one below (which would have been helpful to include for reproducibility's sake):
<!DOCTYPE html>
<head>
<title>Test Page</title>
</head>
<body>
<p class="m-b-0">Paragraph with class 'm-b-0'.</p>
<div id="right-col">
<p>Paragraph inside the 'right-col' div.</p>
</div>
<p>Paragraph outside the targeted tags.</p>
</body>
</html>
the resulting output1.html
will contain the following content:
<html>
<body>
<title>Test Page</title>
<p class="m-b-0">Paragraph with class 'm-b-0'.</p>
<div id="right-col">
<p>Paragraph inside the 'right-col' div.</p>
</div>
</body>
</html>
For simple html I would use standard f-string
or standard .format()
val1 = soup.find('title')
val2 = soup.find('p', attrs={'class': 'm-b-0'})
val3 = soup.find('div', attrs={'id' :'right-col'})
# f-string
new_html = f"<html><body>\n{val1}\n{val2}\n{val3}\n</body></html>"
# .format
new_html = "<html><body>\n{}\n{}\n{}\n</body></html>".format(val1, val2, val3)
Python has also old method with %
new_html = "<html><body>\n%s\n%s\n%s\n</body></html>" % (val1, val2, val3)
There is also string.Template but I never used it - because f-string
can do the same.
from string import Template
template = Template('<html><body>\n$item1\n$item2\n$item3\n</body></html>')
new_html = template.substitute(item1=val1, item2=val2, item3=val3)
For something more complex I would use Jinja which is used by Flask()
.
from jinja2 import Environment, BaseLoader
template = '<html><body>\n{{item1}}\n{{item2}}\n{{item3}}\n</body></html>'
rtemplate = Environment(loader=BaseLoader).from_string(template)
new_html = rtemplate.render(item1=val1, item2=val2, item3=val3)
print(new_html)
It allows to use {% for %}
, {% if %}
, etc directly in template - so I can send all values as list
or tuple
and use for
-loop directly in template
from jinja2 import Environment, BaseLoader
template = '<html><body>\n{% for val in items %}{{val}}\n{% endfor %}</body></html>'
rtemplate = Environment(loader=BaseLoader).from_string(template)
new_html = rtemplate.render(items=(val1,val2,val3))
print(new_html)
Of course you can also try to use BeautifulSoup
to create HTML - see doc for append and extend - but I think that other methods are simpler. BeautifulSoup
can be useful if you have already some (long) HTML
and you want to replace or add some items.
Full code with all examples (except BeautifulSoup
):
Instead of values from soup.find()
I use literally soup.find()
val1 = "soup.find('title')"
val2 = "soup.find('p', attrs={'class': 'm-b-0'})"
val3 = "soup.find('div', attrs={'id' :'right-col'})"
print('\n--- f-string ---\n')
new_html = f"<html><body>\n{val1}\n{val2}\n{val3}\n</body></html>"
print(new_html)
print('\n--- .format() ---\n')
new_html = "<html><body>\n{}\n{}\n{}\n</body></html>".format(val1, val2, val3)
print(new_html)
print('\n--- % ---\n')
new_html = "<html><body>\n%s\n%s\n%s\n</body></html>" % (val1, val2, val3)
print(new_html)
# ------------------
from string import Template
template = Template('<html><body>\n$item1\n$item2\n$item3\n</body></html>')
print('\n--- string.Template ---\n')
new_html = template.substitute(item1=val1, item2=val2, item3=val3)
print(new_html)
# ------------------
from jinja2 import Environment, BaseLoader
print('\n--- jinja ---\n')
template = '<html><body>\n{{item1}}\n{{item2}}\n{{item3}}\n</body></html>'
rtemplate = Environment(loader=BaseLoader).from_string(template)
new_html = rtemplate.render(item1=val1, item2=val2, item3=val3)
print(new_html)
print('\n--- jinja - {% for %} ---\n')
template = '<html><body>\n{% for val in items %}{{val}}\n{% endfor %}</body></html>'
rtemplate = Environment(loader=BaseLoader).from_string(template)
new_html = rtemplate.render(items=(val1,val2,val3))
print(new_html)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With