I am trying to insert an html string into a BeautifulSoup object. If I insert it directly, bs4 sanitizes the html. If take the html string and create a soup from it, and insert that I have problems with using the find
function. This post thread on SO suggests that inserting BeautifulSoup objects can cause problems. I am using the solution from that post and recreating the soup each time I do an insert.
But surely there's a better way to insert an html string into a soup.
EDIT: I'll add some code as an example of what the problem is
from bs4 import BeautifulSoup
mainSoup = BeautifulSoup("""
<html>
<div class='first'></div>
<div class='second'></div>
</html>
""")
extraSoup = BeautifulSoup('<span class="first-content"></span>')
tag = mainSoup.find(class_='first')
tag.insert(1, extraSoup)
print mainSoup.find(class_='second')
# prints None
A new tag can be created by calling BeautifulSoup's inbuilt function new_tag(). Inserting a new tag using the append() method : The new tag is appended to the end of the parent tag.
The HTML content of the webpages can be parsed and scraped with Beautiful Soup.
Simplest way, if you already have an html string, is to insert another BeautifulSoup object.
from bs4 import BeautifulSoup
doc = '''
<div>
test1
</div>
'''
soup = BeautifulSoup(doc, 'html.parser')
soup.div.append(BeautifulSoup('<div>insert1</div>', 'html.parser'))
print soup.prettify()
Output:
<div>
test1
<div>
insert1
</div>
</div>
How about this? Idea is to use BeautifulSoup to generate the right AST node (span tag). Looks like this avoids the "None" problem.
import bs4
from bs4 import BeautifulSoup
mainSoup = BeautifulSoup("""
<html>
<div class='first'></div>
<div class='second'></div>
</html>
""", 'html.parser')
extraSoup = BeautifulSoup('<span class="first-content"></span>', 'html.parser')
tag = mainSoup.find(class_='first')
tag.insert(1, extraSoup.span)
print mainSoup.find(class_='second')
Output:
<div class="second"></div>
The best way to do this is by creating a new tag span
and insert it into your mainSoup
. That is what the .new_tag
method is for.
In [34]: from bs4 import BeautifulSoup
In [35]: mainSoup = BeautifulSoup("""
....: <html>
....: <div class='first'></div>
....: <div class='second'></div>
....: </html>
....: """)
In [36]: tag = mainSoup.new_tag('span')
In [37]: tag.attrs['class'] = 'first-content'
In [38]: mainSoup.insert(1, tag)
In [39]: print(mainSoup.find(class_='second'))
<div class="second"></div>
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With