Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Importing a table from a webpage as a dataframe in Python

I am trying to read in a specific table from the US Customs and Border Protection's Dashboard on Southwest Land Border Encounters as a dataframe.

The url is: https://www.cbp.gov/newsroom/stats/southwest-land-border-encounters. I am particularly interested in the "month" and "U.S. Border Patrol / Total" columns from the final table on the page: table I am trying to import, with columns of interest highlighted

In past web scraping projects I've used the read_html function from the pandas package. But that doesn't work here. This code:

import pandas as pd

pd.read_html('https://www.cbp.gov/newsroom/stats/southwest-land-border-encounters')

generates the error: HTTPError: HTTP Error 403: Forbidden.

Is there a way to programmatically get this data?

like image 543
Ari Avatar asked May 24 '26 14:05

Ari


1 Answers

Our taxes already paid for this, so why the gummint would care about UA makes little sense to me. But specifying one will get you past the 403:

from io import StringIO
import requests
from bs4 import BeautifulSoup


def get(url: str) -> StringIO:
    headers = {"User-Agent": "Mozilla/5.0"}
    resp = requests.get(url, headers=headers)
    resp.raise_for_status()
    return StringIO(resp.text)


url = "https://www.cbp.gov/newsroom/stats/southwest-land-border-encounters"


def main() -> None:
    soup = BeautifulSoup(get(url), "lxml")
    print(soup.prettify())
    pd.read_html(get(url))


if __name__ == "__main__":
    main()

The new diagnostic becomes .read_html() complaining about ValueError: No tables found. Which is true; grep'ing for <table shows no hits.

Let's examine those embedded iframes.

from urllib.parse import unquote

def main() -> None:
    soup = BeautifulSoup(get(url), "lxml")
    # print(soup.prettify())

    for frame in soup.find_all("iframe"):
        if "title" in frame.attrs:
            print(f"\n\n\n{frame['title']}")
            print(unquote(frame["src"]))

Yup, sure enough, this iframe is essentially a JS spreadsheet. Or maybe a pair of sheets. Alas, it's unclear how to easily get at the underlying .CSV data.

like image 110
J_H Avatar answered May 27 '26 03:05

J_H



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!