I've been trying to capture the layout of an HTML page using BeautifulSoup4 and recursion. The idea is to have linked data structures of parents to children, for example, a layout like this:
<html>
<h1>
<!--Contents-->
</h1>
<div>
<div>
<!--Contents-->
</div>
</div>
</html>
Would be stored in a list like so:
html = [ h1 , div ] # Where h1 and div are also lists also containing lists
I've had a hard time finding Q&A on this specific problem, so instead I've tried modeling a function off of using recursion to traverse directories as they're pretty similar.
This is my current function in Python 3 that's supposed to nest tags into lists:
def listGen(l , hObj):
# Where hObj is a BS4 object and l is a sorted lists containing direct children to the html tag
z = []
for x in l:
z.append(hObj.find(x).children)
def expand(xlist1):
# Where xlist1 is a list generator
for n in xlist1:
if n.name is not None:
print(n.name)
for n2 in hObj.find(n.name).children:
if n2.name is not None:
print(n2.name , "--") #Debugging print
return z #Temporary
for x in z:
print("------")
expand(x , 0)
return z
Parsing the Wikipedia Home Page gives me an output of:
------
h1
img --
div --
div
div --
strong --
div
div --
strong --
div
div --
strong --
div
div --
strong --
div
div --
strong --
hr
div
div --
strong --
p
small --
small --
small --
script
script
script
style
------
meta
title
meta
script
meta
link
link
link
style
style
link
link
Which is exactly what I need, however it takes two for loops and will take many more to get all the children of children. Furthermore I wouldn't know how much tags nest from future websites. So I changed the expand
function to:
def expand(xlist1 , depth):
l1 = list(xlist1)
if depth < len(l1):
for n in l1[depth]:
if n is not None:
if hObj.find(l1[depth].name).children:
return expand(hObj.find(l1[depth].name).children , 0)
if n is None:
print(2) # Debugging print
return expand(xlist1 , depth + 1)
if depth >= len(l1):
return 0 # Temporary
return 0 # Temporary
Only to give me maximum recursion errors, I've tried many other variations of it all to no avail.
I've scoured through the BS4 Docs multiple times and there seems to be no built in function for this. Any suggestions or is this not a viable way to achieve what I'm looking for?
I don't think that nested lists are exactly what you're looking for here. If all you're trying to do is build a tree of tags, I would use nested dictionaries. I would still use them if you're trying to extract any other information.
This recursive function will build a nested-dictionary "tree"
def traverse(soup):
if soup.name is not None:
dom_dictionary = {}
dom_dictionary['name'] = soup.name
dom_dictionary['children'] = [ traverse(child) for child in soup.children if child.name is not None]
return dom_dictionary
We can use it like so:
page = requests.get("http://example.com")
soup = BeautifulSoup(page.text, "html5lib")
traverse(soup)
This gives us:
{'name': '[document]',
'children': [{'name': 'html',
'children': [{'name': 'head',
'children': [{'name': 'title', 'children': []},
{'name': 'meta', 'children': []},
{'name': 'meta', 'children': []},
{'name': 'meta', 'children': []},
{'name': 'style', 'children': []}]},
{'name': 'body',
'children': [{'name': 'div',
'children': [{'name': 'h1', 'children': []},
{'name': 'p', 'children': []},
{'name': 'p', 'children': [{'name': 'a', 'children': []}]}]}]}]}]}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With