I have a HTML document I need to process. I'm using 'beautifoulsoup' for that. Now I would like to retrieve a few "subsoups" from that document and join them into one soup so I can later use it as a parameter for a function that expects a soup object.
If it's not clear, I'll give you an example...
from bs4 import BeautifulSoup
my_document = """
<html>
<body>
<h1>Some Heading</h1>
<div id="first">
<p>A paragraph.</p>
<a href="another_doc.html">A link</a>
<p>A paragraph.</p>
</div>
<div id="second">
<p>A paragraph.</p>
<p>A paragraph.</p>
</div>
<div id="third">
<p>A paragraph.</p>
<a href="another_doc.html">A link</a>
<a href="yet_another_doc.html">A link</a>
</div>
<p id="loner">A paragraph.</p>
</body>
</html>
"""
soup = BeautifulSoup(my_document)
# find the needed parts
first = soup.find("div", {"id": "first"})
third = soup.find("div", {"id": "third"})
loner = soup.find("p", {"id": "loner"})
subsoups = [first, third, loner]
# create a new (sub)soup
resulting_soup = do_some_magic(subsoups)
# use it in a function that expects a soup object and calls its methods
function_expecting_a_soup(resulting_soup)
The goal is to have an object in resulting_soup
that is/behaves like a soup with the following content:
<div id="first">
<p>A paragraph.</p>
<a href="another_doc.html">A link</a>
<p>A paragraph.</p>
</div>
<div id="third">
<p>A paragraph.</p>
<a href="another_doc.html">A link</a>
<a href="yet_another_doc.html">A link</a>
</div>
<p id="loner">A paragraph.</p>
Is there a convenient way to do that? If there is a better way to retrieve the "subsoups" than find()
, I can use it instead. Thanks.
Update
There is a solution advised by Wondercricket that concatenates strings containing the found tags and parses them again into a new BeautifulSoup Object. While it's a possible way to solve the problem, the re-parsing may take longer than I'd like especially when I want to retrieve the most of them and there are many such documents I need to process. find()
returns a bs4.element.Tag
. Isn't there a way how to concatenate several Tag
s into one soup without converting the Tag
s to a string and parsing the string?
SoupStrainer
would do exactly what you are asking about and, as a bonus, you'll get a performance boost since it would parse exactly what you want it to parse - not the complete document tree:
from bs4 import BeautifulSoup, SoupStrainer
parse_only = SoupStrainer(id=["first", "third", "loner"])
soup = BeautifulSoup(my_document, "html.parser", parse_only=parse_only)
Now, the soup
object would contain only the desired elements:
<div id="first">
<p>
A paragraph.
</p>
<a href="another_doc.html">
A link
</a>
<p>
A paragraph.
</p>
</div>
<div id="third">
<p>
A paragraph.
</p>
<a href="another_doc.html">
A link
</a>
<a href="yet_another_doc.html">
A link
</a>
</div>
<p id="loner">
A paragraph.
</p>
Is it also possible to specify not only ids but also tags? For example if I want to filter all paragraphs with class="someclass but not divs with the same class?
In this case, you can make a search function to join multiple criteria for the SoupStrainer
:
from bs4 import BeautifulSoup, SoupStrainer, ResultSet
my_document = """
<html>
<body>
<h1>Some Heading</h1>
<div id="first">
<p>A paragraph.</p>
<a href="another_doc.html">A link</a>
<p>A paragraph.</p>
</div>
<div id="second">
<p>A paragraph.</p>
<p>A paragraph.</p>
</div>
<div id="third">
<p>A paragraph.</p>
<a href="another_doc.html">A link</a>
<a href="yet_another_doc.html">A link</a>
</div>
<p id="loner">A paragraph.</p>
<p class="myclass">test</p>
</body>
</html>
"""
def search(tag, attrs):
if tag == "p" and "myclass" in attrs.get("class", []):
return tag
if attrs.get("id") in ["first", "third", "loner"]:
return tag
parse_only = SoupStrainer(search)
soup = BeautifulSoup(my_document, "html.parser", parse_only=parse_only)
print(soup.prettify())
You can use findAll
with passing in the ids
of the elements you want to use.
import bs4
soup = bs4.BeautifulSoup(my_document)
#EDIT -> I discovered you do not need regex, you can pass in a list of `ids`
sub = soup.findAll(attrs={'id': ['first', 'third', 'loner']})
#EDIT -> adding `html.parser` will force `BeautifulSoup` to not auto append `html` and `body` tags.
sub = bs4.BeautifulSoup('\n\n'.join(str(s) for s in sub), 'html.parser')
print(sub)
>>> <div id="first">
<p>A paragraph.</p>
<a href="another_doc.html">A link</a>
<p>A paragraph.</p>
</div>
<div id="third">
<p>A paragraph.</p>
<a href="another_doc.html">A link</a>
<a href="yet_another_doc.html">A link</a>
</div>
<p id="loner">A paragraph.</p>
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With