I am trying to read in a specific table from the US Customs and Border Protection's Dashboard on Southwest Land Border Encounters as a dataframe.
The url is: https://www.cbp.gov/newsroom/stats/southwest-land-border-encounters. I am particularly interested in the "month" and "U.S. Border Patrol / Total" columns from the final table on the page:

In past web scraping projects I've used the read_html function from the pandas package. But that doesn't work here. This code:
import pandas as pd
pd.read_html('https://www.cbp.gov/newsroom/stats/southwest-land-border-encounters')
generates the error: HTTPError: HTTP Error 403: Forbidden.
Is there a way to programmatically get this data?
Our taxes already paid for this, so why the gummint
would care about UA makes little sense to me.
But specifying one will get you past the 403:
from io import StringIO
import requests
from bs4 import BeautifulSoup
def get(url: str) -> StringIO:
headers = {"User-Agent": "Mozilla/5.0"}
resp = requests.get(url, headers=headers)
resp.raise_for_status()
return StringIO(resp.text)
url = "https://www.cbp.gov/newsroom/stats/southwest-land-border-encounters"
def main() -> None:
soup = BeautifulSoup(get(url), "lxml")
print(soup.prettify())
pd.read_html(get(url))
if __name__ == "__main__":
main()
The new diagnostic becomes .read_html() complaining
about ValueError: No tables found.
Which is true;
grep'ing for <table shows no hits.
Let's examine those embedded iframes.
from urllib.parse import unquote
def main() -> None:
soup = BeautifulSoup(get(url), "lxml")
# print(soup.prettify())
for frame in soup.find_all("iframe"):
if "title" in frame.attrs:
print(f"\n\n\n{frame['title']}")
print(unquote(frame["src"]))
Yup, sure enough, this iframe is essentially a JS spreadsheet. Or maybe a pair of sheets. Alas, it's unclear how to easily get at the underlying .CSV data.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With