Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Test if children tag exists in beautifulsoup

i have an XML file with an defined structure but different number of tags, like

file1.xml:

<document>
  <subDoc>
    <id>1</id>
    <myId>1</myId>
  </subDoc>
</document>

file2.xml:

<document>
  <subDoc>
    <id>2</id>
  </subDoc>
</document>

Now i like to check, if the tag myId exits. So i did the following:

data = open("file1.xml",'r').read()
xml = BeautifulSoup(data)

hasAttrBs = xml.document.subdoc.has_attr('myID')
hasAttrPy = hasattr(xml.document.subdoc,'myID')
hasType = type(xml.document.subdoc.myid)

The result is for file1.xml:

hasAttrBs -> False
hasAttrPy -> True
hasType ->   <class 'bs4.element.Tag'>

file2.xml:

hasAttrBs -> False
hasAttrPy -> True
hasType -> <type 'NoneType'>

Okay, <myId> is not an attribute of <subdoc>.

But how i can test, if an sub-tag exists?

//Edit: By the way: I'm don't really like to iterate trough the whole subdoc, because that will be very slow. I hope to find an way where I can direct address/ask that element.

like image 828
The Bndr Avatar asked Oct 20 '15 13:10

The Bndr


People also ask

What does BeautifulSoup Find_all return?

Beautiful Soup's find_all(~) method returns a list of all the tags or strings that match a particular criteria.

Is tag an object of BeautifulSoup?

A tag object in BeautifulSoup corresponds to an HTML or XML tag in the actual page or document. Tags contain lot of attributes and methods and two important features of a tag are its name and attributes.

Can BeautifulSoup handle broken HTML?

BeautifulSoup is a Python package that parses broken HTML, just like lxml supports it based on the parser of libxml2.


4 Answers

if tag.find('child_tag_name'):
like image 123
ahuigo Avatar answered Sep 28 '22 09:09

ahuigo


The simplest way to find if a child tag exists is simply

childTag = xml.find('childTag')
if childTag:
    # do stuff

More specifically to OP's question:

If you don't know the structure of the XML doc, you can use the .find() method of the soup. Something like this:

with open("file1.xml",'r') as data, open("file2.xml",'r') as data2:
    xml = BeautifulSoup(data.read())
    xml2 = BeautifulSoup(data2.read())

    hasAttrBs = xml.find("myId")
    hasAttrBs2 = xml2.find("myId")

If you do know the structure, you can get the desired element by accessing the tag name as an attribute like this xml.document.subdoc.myid. So the whole thing would go something like this:

with open("file1.xml",'r') as data, open("file2.xml",'r') as data2:
    xml = BeautifulSoup(data.read())
    xml2 = BeautifulSoup(data2.read())

    hasAttrBs = xml.document.subdoc.myid
    hasAttrBs2 = xml2.document.subdoc.myid
    print hasAttrBs
    print hasAttrBs2

Prints

<myid>1</myid>
None
like image 31
wpercy Avatar answered Oct 02 '22 09:10

wpercy


Here's an example to check if h2 tag exists in an Instagram URL. Hope you find it useful:

import datetime
import urllib
import requests
from bs4 import BeautifulSoup

instagram_url = 'https://www.instagram.com/p/BHijrYFgX2v/?taken-by=findingmero'
html_source = requests.get(instagram_url).text
soup = BeautifulSoup(html_source, "lxml")

if not soup.find('h2'):
    print("didn't find h2")
like image 4
Mona Jalal Avatar answered Sep 29 '22 09:09

Mona Jalal


you can handle it like this:

for child in xml.document.subdoc.children:
    if 'myId' == child.name:
       return True
like image 1
chyoo CHENG Avatar answered Sep 29 '22 09:09

chyoo CHENG