Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

clone element with beautifulsoup

I have to copy a part of one document to another, but I don't want to modify the document I copy from.

If I use .extract() it removes the element from the tree. If I just append selected element like document2.append(document1.tag) it still removes the element from document1.

As I use real files I can just not save document1 after modification, but is there any way to do this without corrupting a document?

like image 981
Anton Vernigor Avatar asked Apr 14 '14 10:04

Anton Vernigor


People also ask

How do you get an element in BeautifulSoup?

Beautiful Soup provides "find()" and "find_all()" functions to get the specific data from the HTML file by putting the specific tag in the function. find() function - return the first element of given tag. find_all() function - return the all the element of given tag.

How do I use beautifulsoup4 in Python?

To use beautiful soup, you need to install it: $ pip install beautifulsoup4 . Beautiful Soup also relies on a parser, the default is lxml . You may already have it, but you should check (open IDLE and attempt to import lxml). If not, do: $ pip install lxml or $ apt-get install python-lxml .

What is beautifulsoup4?

Beautiful Soup is a Python library for pulling data out of HTML and XML files. It works with your favorite parser to provide idiomatic ways of navigating, searching, and modifying the parse tree. It commonly saves programmers hours or days of work.


3 Answers

There is no native clone function in BeautifulSoup in versions before 4.4 (released July 2015); you'd have to create a deep copy yourself, which is tricky as each element maintains links to the rest of the tree.

To clone an element and all its elements, you'd have to copy all attributes and reset their parent-child relationships; this has to happen recursively. This is best done by not copying the relationship attributes and re-seat each recursively-cloned element:

from bs4 import Tag, NavigableString

def clone(el):
    if isinstance(el, NavigableString):
        return type(el)(el)

    copy = Tag(None, el.builder, el.name, el.namespace, el.nsprefix)
    # work around bug where there is no builder set
    # https://bugs.launchpad.net/beautifulsoup/+bug/1307471
    copy.attrs = dict(el.attrs)
    for attr in ('can_be_empty_element', 'hidden'):
        setattr(copy, attr, getattr(el, attr))
    for child in el.contents:
        copy.append(clone(child))
    return copy

This method is kind-of sensitive to the current BeautifulSoup version; I tested this with 4.3, future versions may add attributes that need to be copied too.

You could also monkeypatch this functionality into BeautifulSoup:

from bs4 import Tag, NavigableString


def tag_clone(self):
    copy = type(self)(None, self.builder, self.name, self.namespace, 
                      self.nsprefix)
    # work around bug where there is no builder set
    # https://bugs.launchpad.net/beautifulsoup/+bug/1307471
    copy.attrs = dict(self.attrs)
    for attr in ('can_be_empty_element', 'hidden'):
        setattr(copy, attr, getattr(self, attr))
    for child in self.contents:
        copy.append(child.clone())
    return copy


Tag.clone = tag_clone
NavigableString.clone = lambda self: type(self)(self)

letting you call .clone() on elements directly:

document2.body.append(document1.find('div', id_='someid').clone())

My feature request to the BeautifulSoup project was accepted and tweaked to use the copy.copy() function; now that BeautifulSoup 4.4 is released you can use that version (or newer) and do:

import copy

document2.body.append(copy.copy(document1.find('div', id_='someid')))
like image 68
Martijn Pieters Avatar answered Oct 17 '22 09:10

Martijn Pieters


It may not be the fastest solution, but it is short and seems to work...

clonedtag = BeautifulSoup(str(sourcetag)).body.contents[0]

BeautifulSoup creates an extra <html><body>...</body></html> around the cloned tag (in order to make the "soup" a sane html document). .body.contents[0] removes those wrapping tags.

This idea was derived Peter Woods comment above and Clemens Klein-Robbenhaar's comment below.

like image 44
andrew pate Avatar answered Oct 17 '22 08:10

andrew pate


For Python:

You can copy the parent element like:

import copy
p_copy = copy.copy(soup.p)
print p_copy
# <p>I want <b>pizza</b> and more <b>pizza</b>!</p>

Ref: https://www.crummy.com/software/BeautifulSoup/bs4/doc/ Section: Copying Beautiful Soup objects

Regards.

like image 43
da7oom Avatar answered Oct 17 '22 09:10

da7oom