Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to grab all headers from a website using BeautifulSoup?

I'm trying to grab all the headers from a simple website. My attempt:

from bs4 import BeautifulSoup, SoupStrainer
import requests

url = "http://nypost.com/business"
page = requests.get(url)
data = page.text
soup = BeautifulSoup(data)
soup.find_all('h')

soup.find_all('h') returns [], but if I do something like soup.h1 or soup.h2, it returns that respective data. Am I just calling the method incorrectly?

like image 849
hiimarksman Avatar asked Jul 12 '17 15:07

hiimarksman


1 Answers

Filter by regular expression:

soup.find_all(re.compile('^h[1-6]$'))

This regex finds all tags that start with h, have a digit after the h, and then end after the digit.

like image 111
phd Avatar answered Oct 24 '22 04:10

phd