Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

how to detect navigation (menu) on web page

Tags:

python

html

xhtml

so i'm writing this program that opens the page and one of the things that it should do is detect how many navigations (menus) web page has, how long is the main navigation (how many elements), average text in elements in navigation and so on...

anyway i have some problems detecting menus. i'm thinking there is 2 ways web navigation is coded:

1. <ul><li><a>Home</a><li><a>Products</a></li>...</ul>
2. <div><a>Home</a><a>Product</a>...</div>

so if i find this structure i know (or should i say "i think") its navigation. but this is NOT bulletproof. i get a lot of miss hits.

so does any1 have any better idea how to detect navigations on web pages?

like image 887
karantan Avatar asked Feb 23 '23 03:02

karantan


2 Answers

There is no universal solution. You need to implement some heuristics. I will try such:

  1. get all site pages with recursion limit=1 (like wget -r -l1 http://example.com/)
  2. for each internal page, keep set of internal links on that page
  3. get intersection of all sets.

This way you will get the constant set of internal links which in most cases will be "menu" of the site.

like image 125
Michał Šrajer Avatar answered Mar 08 '23 07:03

Michał Šrajer


In HTML4 and XHTML there is no standard way of writing menus. In HTML5 you have the <menu> and <nav> tags, but as you have concluded, in earlier versions the generally recommended way is to use an unordered list.

I would probably write a number of tests, and use them all in parallel to try and find the menu, e.g. based on position in the document, structure, and things like id and class attributes (the values of which will often contain "menu").

like image 33
richardolsson Avatar answered Mar 08 '23 07:03

richardolsson