Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Mock/Monkeypatch BeautifulSoup html objects for Pytest

I'm working on a web scraping project in Python and trying to add automated testing w/ Pytest. I'm not new to web scraping but I'm very new to testing, and I believe the idea here is I should mock the HTTP request and replacing it with some dummy html fixture code to test if the rest of the function works without having to rely on requesting anything from the actual url.

Below is my web scraping function.

import pandas as pd
from bs4 import BeautifulSoup
from urllib.request import urlopen

def get_player_stats_data():
    """
    Web Scrape function w/ BS4 that grabs aggregate season stats
    Args:
        None
    Returns:
        Pandas DataFrame of Player Aggregate Season stats
    """
    try:
        year_stats = 2022
        url = f"https://www.basketball-reference.com/leagues/NBA_{year_stats}_per_game.html"
        html = urlopen(url)
        soup = BeautifulSoup(html, "html.parser")

        headers = [th.getText() for th in soup.findAll("tr", limit=2)[0].findAll("th")]
        headers = headers[1:]

        rows = soup.findAll("tr")[1:]
        player_stats = [
            [td.getText() for td in rows[i].findAll("td")] for i in range(len(rows))
        ]

        stats = pd.DataFrame(player_stats, columns=headers)

        print(
            f"General Stats Extraction Function Successful, retrieving {len(stats)} updated rows"
        )
        return stats
    except BaseException as error:
        print(f"General Stats Extraction Function Failed, {error}")
        df = []
        return df

And here is what I'm using to grab the raw html of the page, and pickling it so I can save it and import it for testing.

import pickle
from bs4 import BeautifulSoup
from urllib.request import urlopen

year_stats = 2022
url = "https://www.basketball-reference.com/leagues/NBA_2022_per_game.html"
html = urlopen(url)

# how you save it
with open('new_test/tests/fixture_csvs/stats_html.html', 'wb') as fp:
    while True:
        chunk = html.read(1024)
        if not chunk:
            break
        fp.write(chunk)

# how you open it
with open('new_test/tests/fixture_csvs/stats_html.html', "rb") as fp:
    stats_html = fp.read()

My question is how do I mock/patch/monkeypatch the urlopen(url) call and use the pickled html in its place to create a fixture with it? The Pytest docs example is creating a class & monkeypatching requests.get() where get is an attribute of requests which seems a little different from what i'm doing, and I haven't been able to get mine working, I think i'm supposed to use something other than monkeypatch.setattr? Below is what I tried.

@pytest.fixture(scope="session")
def player_stats_data_raw(monkeypatch):
    """
    Fixture to load web scrape html from an html file for testing.
    """
    fname = os.path.join(
        os.path.dirname(__file__), "fixture_csvs/stats_html.html"
    )

    with open(fname, "rb") as fp:
        html = fp.read()

    def mock_urlopen():
        return html

    monkeypatch.setattr(urlopen, "url", mock_urlopen)
    df = get_player_stats_data()
    return df

### The actual tests in a separate file
def test_raw_stats_rows(player_stats_data_raw):
    assert len(player_stats_data_raw) == 30

def test_raw_stats_schema(player_stats_data_raw):
    assert list(player_stats_data_raw.columns) == raw_stats_cols

The goal is to replace html = urlopen(url) in the web scraping function with this pickled html I've previously saved.

The other option is to turn that url into an input parameter for the function, where in production I just call the actual url as you see here (www.basketballreference.com/etc), and in testing I just read in that pickled value. That's an option but I'm curious to learn & apply this patching technique to a real example. If anyone has any thoughts I'd appreciate it!

like image 609
jyablonski Avatar asked Mar 31 '26 22:03

jyablonski


1 Answers

In your test file, you could try like this:

from module.script import get_player_stats_data


@pytest.fixture(scope="session")
def urlopen(mocker):
    with open(fname, "rb") as fp:
        html = fp.read()
    urlopen = mocker.patch("module.script.urlopen")
    urlopen.return_value = html
    return urlopen


def test_raw_stats_rows(urlopen):
    df = get_player_stats_data()
    assert len(df) == 30


def test_raw_stats_schema(urlopen):
    df = get_player_stats_data()
    assert list(df.columns) == raw_stats_cols
like image 181
Laurent Avatar answered Apr 03 '26 13:04

Laurent



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!