Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

HTML structure into network graph

What I'm trying to do is to represent an HTML site DOM (document object model) into a network graph, and then do some statistical computing with this graph (like degree, betweenness, proximity, plotting of course, etc.). I couldn't find any Library or previous SO post that does it directly. My idea was to use BeautifulSoup Library, then Networkx Library. I tried to write some code looping through each element of the HTML structure (using recursive=True). But I Don't know how to identify each unique tag (you see here that adding a second h1 node into the graph overwrites the first one, same for parents, so the graph is completely false in the end).

import networkx as nx
import bs4
from bs4 import BeautifulSoup
ex0 = "<html><head><title>Are you lost ?</title></head><body><h1>Lost on the Intenet ?</h1><h1>Don't panic, we will help you</h1><strong><pre>    * <----- you are here</pre></strong></body></html>"
soup = BeautifulSoup(ex0)
G=nx.Graph()
for tag in soup.findAll(recursive=True):
    G.add_node(tag.name)
    G.add_edge(tag.name, tag.findParent().name)
nx.draw(G)   
G.nodes
#### NodeView(('html', '[document]', 'head', 'title', 'body', 'h1', 'strong', 'pre'))

enter image description here

Any idea on how it could be done (including completely different approaches). Thanks

PS: the graph could be directed or not, I don't care.

like image 339
agenis Avatar asked Dec 17 '18 14:12

agenis


1 Answers

You can loop over the content attribute of each BeautifulSoup object. To display the labels, simply utilize the with_labels attribute in nx.draw:

import networkx as nx
import matplotlib.pyplot as plt
from collections import defaultdict
from bs4 import BeautifulSoup as soup
ex0 = "<html><head><title>Are you lost ?</title></head><body><h1>Lost on the Intenet ?</h1><h1>Don't panic, we will help you</h1><strong><pre>    * <----- you are here</pre></strong></body></html>"
d = soup(ex0, 'html.parser')
def _traverse_html(_d:soup, _graph:nx.Graph, _counter, _parent=None) -> None:
  for i in _d.contents:
     if i.name is not None:
       try:
         _name_count = _counter.get(i.name)
         if _parent is not None:
           _graph.add_node(_parent)
           _graph.add_edge(_parent, i.name if not _name_count else f'{i.name}_{_name_count}')
         _counter[i.name] += 1
         _traverse_html(i, _graph, _counter, i.name)
       except AttributeError:
         pass

_full_graph = nx.Graph()
_traverse_html(d, _full_graph, defaultdict(int))
nx.draw(_full_graph, with_labels = True)   
plt.show()

enter image description here

like image 85
Ajax1234 Avatar answered Oct 06 '22 00:10

Ajax1234