Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

find() after replaceWith() doesn't work (using BeautifulSoup)

Please consider the following python session:

>>> from BeautifulSoup import BeautifulSoup
>>> s = BeautifulSoup("<p>This <i>is</i> a <i>test</i>.</p>"); myi = s.find("i")
>>> myi.replaceWith(BeautifulSoup("was"))
>>> s.find("i")
>>> s = BeautifulSoup("<p>This <i>is</i> a <i>test</i>.</p>"); myi = s.find("i")
>>> myi.replaceWith("was")
>>> s.find("i")
<i>test</i>

Please note the missing output of s.find("i") after line 4!

What's the reason for this? Is there a workaround?

EDIT: Actually, the example doesn't demonstrate the usecase, which is:

myi.replaceWith(BeautifulSoup("wa<b>s</b>"))

Whenever the inserted part contains itself nontrivial html code, I don't see how you could replace this syntax with something else. Just having

myi.replaceWith("wa<b>s</b>")

will replace the html special chars by entities.

like image 915
thomas Avatar asked Mar 16 '13 21:03

thomas


People also ask

How do I find a specific element with BeautifulSoup?

Beautiful Soup provides "find()" and "find_all()" functions to get the specific data from the HTML file by putting the specific tag in the function. find() function - return the first element of given tag. find_all() function - return the all the element of given tag.

How do I search for a web tree using BeautifulSoup?

There are many Beautifulsoup methods, which allows us to search a parse tree. The two most common and used methods are find() and find_all().

How do you replace text in BeautifulSoup?

To replace the inner text of a tag in Beautiful Soup, use the replace_with(~) method.


3 Answers

Simpler answer : after your call to replaceWith, regenerate and clean s by calling s = BeautifulSoup(s.renderContents()). Then you can find again.

like image 89
Steve K Avatar answered Oct 07 '22 06:10

Steve K


The problem seems to be that a BeautifulSoup object is considered an entire document. find iterates through the document asking each element for the next element after it. But when it gets to your BeautifulSoup("was"), that object thinks it is the whole document, so it says there is nothing after it. This aborts the search too early.

I don't think BeautifulSoup is designed to have BeautifulSoup objects inside other BeautifulSoup objects. The workaround is don't do that. Why do you feel you need to use the first form instead of the second one, which already works? If you want to replace an element with some bit of HTML, use a Tag for your replacement, not a BeautifulSoup object.

like image 21
BrenBarn Avatar answered Oct 07 '22 05:10

BrenBarn


I think, I found a workaround, which solves the issue for me. I repeat the whole code again as a Python script to give a complete example:

from BeautifulSoup import BeautifulSoup
s = BeautifulSoup("<p>This <i>is</i> a <i>test</i>.</p>")
myi = s.find("i")
s2 = BeautifulSoup("wa<b>s</b>")
myi_id = myi.parent.contents.index(myi)
for c in reversed(s2.contents):
    myi.parent.insert(myi_id + 1, c)
myi.extract()

Please note, that this won't work without reversed(). If you skip it, you don't only change the order of the elements. If you really want the order to be changed, you will have to write the following:

for c in list(s2.contents):
    myi.parent.insert(myi_id + 1, c)

Can somebody please explain, why skipping list() will omit <b>s</b>? (Please answer in a comment, because this is not the main question here.)

like image 29
thomas Avatar answered Oct 07 '22 06:10

thomas