Importing a table from a webpage as a dataframe in Python

Question

I am trying to read in a specific table from the US Customs and Border Protection's Dashboard on Southwest Land Border Encounters as a dataframe.

The url is: https://www.cbp.gov/newsroom/stats/southwest-land-border-encounters. I am particularly interested in the "month" and "U.S. Border Patrol / Total" columns from the final table on the page: table I am trying to import, with columns of interest highlighted

In past web scraping projects I've used the read_html function from the pandas package. But that doesn't work here. This code:

import pandas as pd

pd.read_html('https://www.cbp.gov/newsroom/stats/southwest-land-border-encounters')

generates the error: HTTPError: HTTP Error 403: Forbidden.

Is there a way to programmatically get this data?

J_H · Accepted Answer

Our taxes already paid for this, so why the gummint would care about UA makes little sense to me. But specifying one will get you past the 403:

from io import StringIO
import requests
from bs4 import BeautifulSoup


def get(url: str) -> StringIO:
    headers = {"User-Agent": "Mozilla/5.0"}
    resp = requests.get(url, headers=headers)
    resp.raise_for_status()
    return StringIO(resp.text)


url = "https://www.cbp.gov/newsroom/stats/southwest-land-border-encounters"


def main() -> None:
    soup = BeautifulSoup(get(url), "lxml")
    print(soup.prettify())
    pd.read_html(get(url))


if __name__ == "__main__":
    main()

The new diagnostic becomes .read_html() complaining about ValueError: No tables found. Which is true; grep'ing for <table shows no hits.

Let's examine those embedded iframes.

from urllib.parse import unquote

def main() -> None:
    soup = BeautifulSoup(get(url), "lxml")
    # print(soup.prettify())

    for frame in soup.find_all("iframe"):
        if "title" in frame.attrs:
            print(f"


{frame['title']}")
            print(unquote(frame["src"]))

Yup, sure enough, this iframe is essentially a JS spreadsheet. Or maybe a pair of sheets. Alas, it's unclear how to easily get at the underlying .CSV data.

Importing a table from a webpage as a dataframe in Python

Tags:

python

pandas

web-scraping

Ari

1 Answers

J_H

Recent Activity

Donate For Us

Importing a table from a webpage as a dataframe in Python

Tags:

python

pandas

web-scraping

Ari

1 Answers

J_H

Related questions

Recent Activity

Donate For Us