Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Insert html string into BeautifulSoup object

I am trying to insert an html string into a BeautifulSoup object. If I insert it directly, bs4 sanitizes the html. If take the html string and create a soup from it, and insert that I have problems with using the find function. This post thread on SO suggests that inserting BeautifulSoup objects can cause problems. I am using the solution from that post and recreating the soup each time I do an insert.

But surely there's a better way to insert an html string into a soup.

EDIT: I'll add some code as an example of what the problem is

from bs4 import BeautifulSoup

mainSoup = BeautifulSoup("""
<html>
    <div class='first'></div>
    <div class='second'></div>
</html>
""")

extraSoup = BeautifulSoup('<span class="first-content"></span>')

tag = mainSoup.find(class_='first')
tag.insert(1, extraSoup)

print mainSoup.find(class_='second')
# prints None
like image 482
Preom Avatar asked Jul 05 '15 11:07

Preom


People also ask

How do you add a tag to a BeautifulSoup object?

A new tag can be created by calling BeautifulSoup's inbuilt function new_tag(). Inserting a new tag using the append() method : The new tag is appended to the end of the parent tag.

Can BeautifulSoup parse HTML?

The HTML content of the webpages can be parsed and scraped with Beautiful Soup.


2 Answers

Simplest way, if you already have an html string, is to insert another BeautifulSoup object.

from bs4 import BeautifulSoup

doc = '''
<div>
 test1
</div>
'''

soup = BeautifulSoup(doc, 'html.parser')

soup.div.append(BeautifulSoup('<div>insert1</div>', 'html.parser'))

print soup.prettify()

Output:

<div>
 test1
<div>
 insert1
</div>
</div>

Update 1

How about this? Idea is to use BeautifulSoup to generate the right AST node (span tag). Looks like this avoids the "None" problem.

import bs4
from bs4 import BeautifulSoup

mainSoup = BeautifulSoup("""
<html>
    <div class='first'></div>
    <div class='second'></div>
</html>
""", 'html.parser')

extraSoup = BeautifulSoup('<span class="first-content"></span>', 'html.parser')
tag = mainSoup.find(class_='first')
tag.insert(1, extraSoup.span)

print mainSoup.find(class_='second')

Output:

<div class="second"></div>
like image 70
Matthew King Avatar answered Nov 27 '22 16:11

Matthew King


The best way to do this is by creating a new tag span and insert it into your mainSoup. That is what the .new_tag method is for.

In [34]: from bs4 import BeautifulSoup

In [35]: mainSoup = BeautifulSoup("""
   ....: <html>
   ....:     <div class='first'></div>
   ....:     <div class='second'></div>
   ....: </html>
   ....: """)

In [36]: tag = mainSoup.new_tag('span')

In [37]: tag.attrs['class'] = 'first-content'

In [38]: mainSoup.insert(1, tag)

In [39]: print(mainSoup.find(class_='second'))
<div class="second"></div>
like image 28
styvane Avatar answered Nov 27 '22 17:11

styvane