I'm looking at creating a dictionary in python where the key is the html tag name and the value is the number of times the tag appeared. Is there a way to do this with beautiful soup or something else?
BeautifulSoup is really good for HTML parsing, and you could certainly use it for this purpose. It would be extremely simple:
from bs4 import BeautifulSoup as BS
def num_apperances_of_tag(tag_name, html):
soup = BS(html)
return len(soup.find_all(tag_name))
With BeautifulSoup you can search for all tags by omitting the search criteria:
# print all tags
for tag in soup.findAll():
print tag.name # TODO: add/update dict
If you're only interested in the number of occurrences, BeautifulSoup may be a bit overkill in which case you could use the HTMLParser
instead:
from HTMLParser import HTMLParser
class print_tags(HTMLParser):
def handle_starttag(self, tag, attrs):
print tag # TODO: add/update dict
parser = print_tags()
parser.feed(html)
This will produce the same output.
To create the dictionary of { 'tag' : count }
you could use collections.defaultdict
:
from collections import defaultdict
occurrences = defaultdict(int)
# ...
occurrences[tag_name] += 1
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With