Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Navigation with BeautifulSoup

I am slightly confused about how to use BeautifulSoup to navigate the HTML tree.

import requests
from bs4 import BeautifulSoup

url = 'http://examplewebsite.com'
source = requests.get(url)
content = source.content
soup = BeautifulSoup(source.content, "html.parser")

# Now I navigate the soup
for a in soup.findAll('a'):
    print a.get("href")
  1. Is there a way to find only particular href by the labels? For example, all the href's I want are called by a certain name, e.g. price in an online catalog.

  2. The href links I want are all in a certain location within the webpage, within the page's and a certain . Can I access only these links?

  3. How can I scrape the contents within each href link and save into a file format?

like image 448
ShanZhengYang Avatar asked Oct 29 '15 00:10

ShanZhengYang


People also ask

Can BeautifulSoup use XPath?

Nope, BeautifulSoup, by itself, does not support XPath expressions.

What is BeautifulSoup used for?

Beautiful Soup is a Python library that is used for web scraping purposes to pull the data out of HTML and XML files. It creates a parse tree from page source code that can be used to extract data in a hierarchical and more readable manner.

Is Scrapy faster than BeautifulSoup?

Performance. Due to the built-in support for generating feed exports in multiple formats, as well as selecting and extracting data from various sources, the performance of Scrapy can be said to be faster than Beautiful Soup. Working with Beautiful Soup can speed up with the help of Multithreading process.


Video Answer


1 Answers

With BeautifulSoup, that's all doable and simple.

(1) Is there a way to find only particular href by the labels? For example, all the href's I want are called by a certain name, e.g. price in an online catalog.

Say, all the links you need have price in the text - you can use a text argument:

soup.find_all("a", text="price")  # text equals to 'price' exactly
soup.find_all("a", text=lambda text: text and "price" in text)  # 'price' is inside the text

Yes, you may use functions and many other different kind of objects to filter elements, like, for example, compiled regular expressions:

import re

soup.find_all("a", text=re.compile(r"^[pP]rice"))

If price is somewhere in the "href" attribute, you can have use the following CSS selector:

soup.select("a[href*=price]")  # href contains 'price'
soup.select("a[href^=price]")  # href starts with 'price'
soup.select("a[href$=price]")  # href ends with 'price'

or, via find_all():

soup.find_all("a", href=lambda href: href and "price" in href)

(2) The href links I want are all in a certain location within the webpage, within the page's and a certain . Can I access only these links?

Sure, locate the appropriate container and call find_all() or other searching methods:

container = soup.find("div", class_="container")
for link in container.select("a[href*=price"):
    print(link["href"])

Or, you may write your CSS selector the way you search for links inside a specific element having the desired attribute or attribute values. For example, here we are searching for a elements having href attributes located inside a div element having container class:

soup.select("div.container a[href]")

(3) How can I scrape the contents within each href link and save into a file format?

If I understand correctly, you need to get appropriate links, follow them and save the source code of the pages locally into HTML files. There are multiple options to choose from depending on your requirements (for instance, speed may be critical. Or, it's just a one-time task and you don't care about performance).

If you would stay with requests, the code would be of a blocking nature - you'll extract the link, follow it, save the page source and then proceed to a next one - the main downside of it is that it would be slow (depending on, for starters, how much links are there). Sample code to get you going:

from urlparse import urljoin

from bs4 import BeautifulSoup
import requests

base_url = 'http://examplewebsite.com'
with requests.Session() as session:  # maintaining a web-scraping session
    soup = BeautifulSoup(session.get(base_url).content, "html.parser")

    for link in soup.select("div.container a[href]"):
        full_link = urljoin(base_url, link["href"])
        title = a.get_text(strip=True)

        with open(title + ".html", "w") as f:
            f.write(session.get(full_link).content)

You may look into grequests or Scrapy to solve that part.

like image 53
alecxe Avatar answered Oct 10 '22 06:10

alecxe