Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Parsing nested HTML list with BeautifulSoup

I need to parse a nested HTML list and convert it to a parent-child dict. Given this list:

<ul>
  <li>Operating System
    <ul>
      <li>Linux
        <ul>
          <li>Debian</li>
          <li>Fedora</li>
          <li>Ubuntu</li>
        </ul>
      </li>
      <li>Windows</li>
      <li>OS X</li>
    </ul>
  </li>
  <li>Programming Languages
    <ul>
      <li>Python</li>
      <li>C#</li>
      <li>Ruby</li>
    </ul>
  </li>
</ul>

I want to convert it to a dict like this:

{
    'Operating System': {
        'Linux': {
            'Debian': None,
            'Fedora': None,
            'Ubuntu': None,
        },
        'Windows': None,
        'OS X': None,
    },
    'Programming Languages': {
        'Python': None,
        'C#': None,
        'Ruby': None,
    }
}

My initial attempt is using find_all('li', recursive=False). It returns the top level items (Operating System and Programming Languages) but also the children.

How can I do it with BeautifulSoup?

like image 749
flowfree Avatar asked Jul 25 '13 05:07

flowfree


1 Answers

Here's one way:

def dictify(ul):
    result = {}
    for li in ul.find_all("li", recursive=False):
        key = next(li.stripped_strings)
        ul = li.find("ul")
        if ul:
            result[key] = dictify(ul)
        else:
            result[key] = None
    return result

Example use:

>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup("""
... <ul>
...   <li>Operating System
...     <ul>
...       <li>Linux
...         <ul>
...           <li>Debian</li>
...           <li>Fedora</li>
...           <li>Ubuntu</li>
...         </ul>
...       </li>
...       <li>Windows</li>
...       <li>OS X</li>
...     </ul>
...   </li>
...   <li>Programming Languages
...     <ul>
...       <li>Python</li>
...       <li>C#</li>
...       <li>Ruby</li>
...     </ul>
...   </li>
... </ul>
... """)
>>> ul = soup.body.ul
>>> from pprint import pprint
>>> pprint(dictify(ul), width=1)
{u'Operating System': {u'Linux': {u'Debian': None,
                                  u'Fedora': None,
                                  u'Ubuntu': None},
                       u'OS X': None,
                       u'Windows': None},
 u'Programming Languages': {u'C#': None,
                            u'Python': None,
                            u'Ruby': None}}
like image 172
Zero Piraeus Avatar answered Nov 12 '22 00:11

Zero Piraeus