I am doing some web scraping using playwright-python>=1.41, and have to launch the browser in a headed mode (e.g. launch(headless=False).
For CI testing, I would like to somehow cache the headed interactions with Chromium, to enable offline testing:
How can this be done? I can't find any clear answers on how to do this.
It might solve your problem using HAR-file recording:
Here is how to do that with playwright==1.41.1 and pytest-playwright==0.3.3:
import pathlib
import pytest
from playwright.sync_api import Browser, Playwright
CACHE_DIR = pathlib.Path(__file__).parent / "cache"
@pytest.fixture(name="example_har", scope="session")
def fixture_example_har(playwright: Playwright) -> pathlib.Path:
har_file = CACHE_DIR / "example.har"
with (
playwright.chromium.launch(headless=False) as browser,
browser.new_page() as page,
):
page.route_from_har(har_file, url="*/**", update=True)
page.goto("https://example.com/")
return har_file
def test_caching(browser: Browser, example_har: pathlib.Path) -> None:
with browser.new_context(offline=True) as context:
page = context.new_page()
page.route_from_har(example_har, url="*/**")
page.goto("https://example.com/")
Use the set_extra_http_headers() and set_offline() methods to cache headed interactions and also launch the browser with a specific cache directory so the same cache is used across multiple invocations of your script.
from playwright.sync_api import sync_playwright
browser.context.set_offline(True)
browser = sync_playwright().chromium.launch(
headless=False,
chromium_sandbox=False,
args=["--disk-cache-dir=/path/to/cache"],
)
browser.context.set_extra_http_headers({"Cache-Control": "max-age=31536000"})
# Do web scraping here
browser.close()
Here the error playwright._impl._errors.Error: net::ERR_INTERNET_DISCONNECTED says that browser is still trying to make network requests even though you have set the context to offline.
This may occur when your web scraping code is still trying to access external resources, such as images or stylesheets.
make sure that your web scraping code only interacts with the cached content.
Try using the networkIdleTimeout option when setting the context to offline. This will wait for all network requests to complete before setting the context to offline
browser.context.set_offline(True, network_idle_timeout=5000)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With