Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Get all HTML tags with Beautiful Soup

I am trying to get a list of all html tags from beautiful soup.

I see find all but I have to know the name of the tag before I search.

If there is text like

html = """<div>something</div>
<div>something else</div>
<div class='magical'>hi there</div>
<p>ok</p>"""

How would I get a list like

list_of_tags = ["<div>", "<div>", "<div class='magical'>", "<p>"]

I know how to do this with regex, but am trying to learn BS4

like image 858
humanbeing Avatar asked Mar 19 '16 23:03

humanbeing


People also ask

How do you get the HTML code from BeautifulSoup?

To get all the HTML tags of a web page using the BeautifulSoup library first import BeautifulSoup and requests library to make a GET request to the web page. Step-by-step Approach: Import required modules.

Can BeautifulSoup parse HTML?

The HTML content of the webpages can be parsed and scraped with Beautiful Soup.

How do you scrape a tag with BeautifulSoup?

Step-by-step Approach. Step 1: The first step will be for scraping we need to import beautifulsoup module and get the request of the website we need to import the requests module. Step 2: The second step will be to request the URL call get method.


3 Answers

You don't have to specify any arguments to find_all() - in this case, BeautifulSoup would find you every tag in the tree, recursively.

Sample:

from bs4 import BeautifulSoup

html = """<div>something</div>
<div>something else</div>
<div class='magical'>hi there</div>
<p>ok</p>
"""
soup = BeautifulSoup(html, "html.parser")

print([tag.name for tag in soup.find_all()])
# ['div', 'div', 'div', 'p']

print([str(tag) for tag in soup.find_all()])
# ['<div>something</div>', '<div>something else</div>', '<div class="magical">hi there</div>', '<p>ok</p>']
like image 113
alecxe Avatar answered Oct 18 '22 20:10

alecxe


Please try the below--

for tag in soup.findAll(True):
    print(tag.name)
like image 7
Anjan Avatar answered Oct 18 '22 19:10

Anjan


I thought I'd share my solution to a very similar question for those that find themselves here, later.

Example

I needed to find all tags quickly but only wanted unique values. I'll use the Python calendar module to demonstrate.

We'll generate an html calendar then parse it, finding all and only those unique tags present.

The below structure is very similar to the above, using set comprehensions:

from bs4 import BeautifulSoup
import calendar

html_cal = calendar.HTMLCalendar().formatmonth(2020, 1)
set(tag.name for tag in BeautifulSoup(html_cal, 'html.parser').find_all())

# Result
# {'table', 'td', 'th', 'tr'}
like image 4
Jason R Stevens CFA Avatar answered Oct 18 '22 18:10

Jason R Stevens CFA