I am slightly confused about how to use BeautifulSoup to navigate the HTML tree.
import requests
from bs4 import BeautifulSoup
url = 'http://examplewebsite.com'
source = requests.get(url)
content = source.content
soup = BeautifulSoup(source.content, "html.parser")
# Now I navigate the soup
for a in soup.findAll('a'):
print a.get("href")
Is there a way to find only particular href
by the labels? For example, all the href
's I want are called by a certain name, e.g. price
in an online catalog.
The href
links I want are all in a certain location within the webpage, within the page's and a certain . Can I access only these links?
How can I scrape the contents within each href
link and save into a file format?
Nope, BeautifulSoup, by itself, does not support XPath expressions.
Beautiful Soup is a Python library that is used for web scraping purposes to pull the data out of HTML and XML files. It creates a parse tree from page source code that can be used to extract data in a hierarchical and more readable manner.
Performance. Due to the built-in support for generating feed exports in multiple formats, as well as selecting and extracting data from various sources, the performance of Scrapy can be said to be faster than Beautiful Soup. Working with Beautiful Soup can speed up with the help of Multithreading process.
With BeautifulSoup
, that's all doable and simple.
(1) Is there a way to find only particular href by the labels? For example, all the href's I want are called by a certain name, e.g. price in an online catalog.
Say, all the links you need have price
in the text - you can use a text
argument:
soup.find_all("a", text="price") # text equals to 'price' exactly
soup.find_all("a", text=lambda text: text and "price" in text) # 'price' is inside the text
Yes, you may use functions and many other different kind of objects to filter elements, like, for example, compiled regular expressions:
import re
soup.find_all("a", text=re.compile(r"^[pP]rice"))
If price
is somewhere in the "href" attribute, you can have use the following CSS selector:
soup.select("a[href*=price]") # href contains 'price'
soup.select("a[href^=price]") # href starts with 'price'
soup.select("a[href$=price]") # href ends with 'price'
or, via find_all()
:
soup.find_all("a", href=lambda href: href and "price" in href)
(2) The href links I want are all in a certain location within the webpage, within the page's and a certain . Can I access only these links?
Sure, locate the appropriate container and call find_all()
or other searching methods:
container = soup.find("div", class_="container")
for link in container.select("a[href*=price"):
print(link["href"])
Or, you may write your CSS selector the way you search for links inside a specific element having the desired attribute or attribute values. For example, here we are searching for a
elements having href
attributes located inside a div
element having container
class:
soup.select("div.container a[href]")
(3) How can I scrape the contents within each href link and save into a file format?
If I understand correctly, you need to get appropriate links, follow them and save the source code of the pages locally into HTML files. There are multiple options to choose from depending on your requirements (for instance, speed may be critical. Or, it's just a one-time task and you don't care about performance).
If you would stay with requests
, the code would be of a blocking nature - you'll extract the link, follow it, save the page source and then proceed to a next one - the main downside of it is that it would be slow (depending on, for starters, how much links are there). Sample code to get you going:
from urlparse import urljoin
from bs4 import BeautifulSoup
import requests
base_url = 'http://examplewebsite.com'
with requests.Session() as session: # maintaining a web-scraping session
soup = BeautifulSoup(session.get(base_url).content, "html.parser")
for link in soup.select("div.container a[href]"):
full_link = urljoin(base_url, link["href"])
title = a.get_text(strip=True)
with open(title + ".html", "w") as f:
f.write(session.get(full_link).content)
You may look into grequests
or Scrapy
to solve that part.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With