Navigation with BeautifulSoup

Tags:

I am slightly confused about how to use BeautifulSoup to navigate the HTML tree.

import requests
from bs4 import BeautifulSoup

url = 'http://examplewebsite.com'
source = requests.get(url)
content = source.content
soup = BeautifulSoup(source.content, "html.parser")

# Now I navigate the soup
for a in soup.findAll('a'):
    print a.get("href")

Is there a way to find only particular href by the labels? For example, all the href's I want are called by a certain name, e.g. price in an online catalog.
The href links I want are all in a certain location within the webpage, within the page's and a certain . Can I access only these links?
How can I scrape the contents within each href link and save into a file format?

448

asked Oct 29 '15 00:10

ShanZhengYang

Video Answer

1 Answers

With BeautifulSoup, that's all doable and simple.

(1) Is there a way to find only particular href by the labels? For example, all the href's I want are called by a certain name, e.g. price in an online catalog.

Say, all the links you need have price in the text - you can use a text argument:

soup.find_all("a", text="price")  # text equals to 'price' exactly
soup.find_all("a", text=lambda text: text and "price" in text)  # 'price' is inside the text

Yes, you may use functions and many other different kind of objects to filter elements, like, for example, compiled regular expressions:

import re

soup.find_all("a", text=re.compile(r"^[pP]rice"))

If price is somewhere in the "href" attribute, you can have use the following CSS selector:

soup.select("a[href*=price]")  # href contains 'price'
soup.select("a[href^=price]")  # href starts with 'price'
soup.select("a[href$=price]")  # href ends with 'price'

or, via find_all():

soup.find_all("a", href=lambda href: href and "price" in href)

(2) The href links I want are all in a certain location within the webpage, within the page's and a certain . Can I access only these links?

Sure, locate the appropriate container and call find_all() or other searching methods:

container = soup.find("div", class_="container")
for link in container.select("a[href*=price"):
    print(link["href"])

Or, you may write your CSS selector the way you search for links inside a specific element having the desired attribute or attribute values. For example, here we are searching for a elements having href attributes located inside a div element having container class:

soup.select("div.container a[href]")

(3) How can I scrape the contents within each href link and save into a file format?

If I understand correctly, you need to get appropriate links, follow them and save the source code of the pages locally into HTML files. There are multiple options to choose from depending on your requirements (for instance, speed may be critical. Or, it's just a one-time task and you don't care about performance).

If you would stay with requests, the code would be of a blocking nature - you'll extract the link, follow it, save the page source and then proceed to a next one - the main downside of it is that it would be slow (depending on, for starters, how much links are there). Sample code to get you going:

from urlparse import urljoin

from bs4 import BeautifulSoup
import requests

base_url = 'http://examplewebsite.com'
with requests.Session() as session:  # maintaining a web-scraping session
    soup = BeautifulSoup(session.get(base_url).content, "html.parser")

    for link in soup.select("div.container a[href]"):
        full_link = urljoin(base_url, link["href"])
        title = a.get_text(strip=True)

        with open(title + ".html", "w") as f:
            f.write(session.get(full_link).content)

You may look into grequests or Scrapy to solve that part.

answered Oct 10 '22 06:10

alecxe

Related questions
                            
                                how to give some unique id to each anonymous user in django
                            
                                Union with tuples Python
                            
                                Why does np.percentile return NaN for high percentiles?
                            
                                Python set interpetation of 1 and True
                            
                                Factorial of a matrix elementwise with Numpy
                            
                                How do I run twisted from the console?
                            
                                How to get an attribute of an Element that is namespaced
                            
                                why is gevent-websocket synchronous?
                            
                                Remove Outliers from dataset
                            
                                PEP 3103: Difference between switch case and if statement code blocks
                            
                                Python Telegram Bot - Send Image
                            
                                How to replace all occurences except the first one?
                            
                                Issue with scipy install on windows
                            
                                Python and BeautifulSoup Opening pages
                            
                                List of language codes (ISO639-1) in Python?
                            
                                Parse yaml into a list in python
                            
                                Python regex to extract a portion of string
                            
                                BeautifulSoup replaceWith() method adding escaped html, want it unescaped
                            
                                Astropy matplotlib and plot galactic coordinates
                            
                                Compile Latex file using a python script

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Navigation with BeautifulSoup

Tags:

python

html

html-parsing

beautifulsoup

python-requests

ShanZhengYang

People also ask

Video Answer

1 Answers

alecxe

Recent Activity

Donate For Us