Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Using BeautifulSoup 4 and recursion to capture the structure of HTML nested tags

I've been trying to capture the layout of an HTML page using BeautifulSoup4 and recursion. The idea is to have linked data structures of parents to children, for example, a layout like this:

<html>
 <h1>
  <!--Contents-->
 </h1>
 <div> 
  <div> 
   <!--Contents-->
  </div>
 </div>
</html>

Would be stored in a list like so:

html = [ h1 , div ] # Where h1 and div are also lists also containing lists

I've had a hard time finding Q&A on this specific problem, so instead I've tried modeling a function off of using recursion to traverse directories as they're pretty similar.

This is my current function in Python 3 that's supposed to nest tags into lists:

def listGen(l , hObj):
    # Where hObj is a BS4 object and l is a sorted lists containing direct children to the html tag
    z = []
    for x in l:
        z.append(hObj.find(x).children)

    def expand(xlist1):
        # Where xlist1 is a list generator 
        for n in xlist1:
            if n.name is not None:
                print(n.name)
                for n2 in hObj.find(n.name).children:
                    if n2.name is not None:
                        print(n2.name , "--") #Debugging print              
        return z #Temporary
    for x in z:
        print("------")
        expand(x , 0)
    return z

Parsing the Wikipedia Home Page gives me an output of:

------
h1
img --
div --
div
div --
strong --
div
div --
strong --
div
div --
strong --
div
div --
strong --
div
div --
strong --
hr
div
div --
strong --
p
small --
small --
small --
script
script
script
style
------
meta
title
meta
script
meta
link
link
link
style
style
link
link

Which is exactly what I need, however it takes two for loops and will take many more to get all the children of children. Furthermore I wouldn't know how much tags nest from future websites. So I changed the expand function to:

def expand(xlist1 , depth): 
    l1 = list(xlist1)
    if depth < len(l1):
        for n in l1[depth]:
            if n is not None:
                if hObj.find(l1[depth].name).children:
                    return expand(hObj.find(l1[depth].name).children , 0)
            if n is None:
                print(2) # Debugging print
                return expand(xlist1 , depth + 1)
    if depth >= len(l1):
        return 0 # Temporary
    return 0 # Temporary

Only to give me maximum recursion errors, I've tried many other variations of it all to no avail.

I've scoured through the BS4 Docs multiple times and there seems to be no built in function for this. Any suggestions or is this not a viable way to achieve what I'm looking for?

like image 387
GKE Avatar asked Oct 17 '18 20:10

GKE


1 Answers

I don't think that nested lists are exactly what you're looking for here. If all you're trying to do is build a tree of tags, I would use nested dictionaries. I would still use them if you're trying to extract any other information.

This recursive function will build a nested-dictionary "tree"

def traverse(soup):
    if soup.name is not None:
        dom_dictionary = {}
        dom_dictionary['name'] = soup.name
        dom_dictionary['children'] = [ traverse(child) for child in soup.children if child.name is not None]
        return dom_dictionary

We can use it like so:

page = requests.get("http://example.com")
soup = BeautifulSoup(page.text, "html5lib")
traverse(soup)

This gives us:

{'name': '[document]',
 'children': [{'name': 'html',
   'children': [{'name': 'head',
     'children': [{'name': 'title', 'children': []},
      {'name': 'meta', 'children': []},
      {'name': 'meta', 'children': []},
      {'name': 'meta', 'children': []},
      {'name': 'style', 'children': []}]},
    {'name': 'body',
     'children': [{'name': 'div',
       'children': [{'name': 'h1', 'children': []},
        {'name': 'p', 'children': []},
        {'name': 'p', 'children': [{'name': 'a', 'children': []}]}]}]}]}]}
like image 155
Noah B. Johnson Avatar answered Oct 23 '22 03:10

Noah B. Johnson