I need to parse a nested HTML list and convert it to a parent-child dict. Given this list:
<ul>
<li>Operating System
<ul>
<li>Linux
<ul>
<li>Debian</li>
<li>Fedora</li>
<li>Ubuntu</li>
</ul>
</li>
<li>Windows</li>
<li>OS X</li>
</ul>
</li>
<li>Programming Languages
<ul>
<li>Python</li>
<li>C#</li>
<li>Ruby</li>
</ul>
</li>
</ul>
I want to convert it to a dict like this:
{
'Operating System': {
'Linux': {
'Debian': None,
'Fedora': None,
'Ubuntu': None,
},
'Windows': None,
'OS X': None,
},
'Programming Languages': {
'Python': None,
'C#': None,
'Ruby': None,
}
}
My initial attempt is using find_all('li', recursive=False)
. It returns the top level items (Operating System and Programming Languages) but also the children.
How can I do it with BeautifulSoup?
Here's one way:
def dictify(ul):
result = {}
for li in ul.find_all("li", recursive=False):
key = next(li.stripped_strings)
ul = li.find("ul")
if ul:
result[key] = dictify(ul)
else:
result[key] = None
return result
Example use:
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup("""
... <ul>
... <li>Operating System
... <ul>
... <li>Linux
... <ul>
... <li>Debian</li>
... <li>Fedora</li>
... <li>Ubuntu</li>
... </ul>
... </li>
... <li>Windows</li>
... <li>OS X</li>
... </ul>
... </li>
... <li>Programming Languages
... <ul>
... <li>Python</li>
... <li>C#</li>
... <li>Ruby</li>
... </ul>
... </li>
... </ul>
... """)
>>> ul = soup.body.ul
>>> from pprint import pprint
>>> pprint(dictify(ul), width=1)
{u'Operating System': {u'Linux': {u'Debian': None,
u'Fedora': None,
u'Ubuntu': None},
u'OS X': None,
u'Windows': None},
u'Programming Languages': {u'C#': None,
u'Python': None,
u'Ruby': None}}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With