I have a partially converted XML document in soup coming from HTML. After some replacement and editing in the soup, the body is essentially -
<Text...></Text> # This replaces <a href..> tags but automatically creates the </Text>
<p class=norm ...</p>
<p class=norm ...</p>
<Text...></Text>
<p class=norm ...</p> and so forth.
I need to "move" the <p> tags to be children to <Text> or know how to suppress the </Text>. I want -
<Text...>
<p class=norm ...</p>
<p class=norm ...</p>
</Text>
<Text...>
<p class=norm ...</p>
</Text>
I've tried using item.insert and item.append but I'm thinking there must be a more elegant solution.
for item in soup.findAll(['p','span']):
if item.name == 'span' and item.has_key('class') and item['class'] == 'section':
xBCV = short_2_long(item._getAttrMap().get('value',''))
if currentnode:
pass
currentnode = Tag(soup,'Text', attrs=[('TypeOf', 'Section'),... ])
item.replaceWith(currentnode) # works but creates end tag
elif item.name == 'p' and item.has_key('class') and item['class'] == 'norm':
childcdatanode = None
for ahref in item.findAll('a'):
if childcdatanode:
pass
newlink = filter_hrefs(str(ahref))
childcdatanode = Tag(soup, newlink)
ahref.replaceWith(childcdatanode)
Thanks
You can use insert to move tags. The docs say: "An element can occur in only one place in one parse tree. If you give insert an element that's already connected to a soup object, it gets disconnected (with extract) before it gets connected elsewhere."
If your HTML looks like this:
<text></text>
<p class="norm">1</p>
<p class="norm">2</p>
<text></text>
<p class="norm">3</p>
... this:
for item in soup.findAll(['text', 'p']):
if item.name == 'text':
text = item
if item.name == 'p':
text.insert(len(text.contents), item)
... would produce the following:
<text><p class="norm">1</p><p class="norm">2</p></text>
<text><p class="norm">3</p></text>
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With