Logo Questions Linux Laravel Mysql Ubuntu Git Menu

pandas read_html ValueError: No tables found

I am trying to scrap the historical weather data from the "https://www.wunderground.com/personal-weather-station/dashboard?ID=KMAHADLE7#history/tdata/s20170201/e20170201/mcustom.html" weather underground page. I have the following code:

import pandas as pd 

page_link = 'https://www.wunderground.com/personal-weather-station/dashboard?ID=KMAHADLE7#history/tdata/s20170201/e20170201/mcustom.html'
df = pd.read_html(page_link)

I have the following response:

Traceback (most recent call last):
 File "weather_station_scrapping.py", line 11, in <module>
  result = pd.read_html(page_link)
 File "/anaconda3/lib/python3.6/site-packages/pandas/io/html.py", line 987, in read_html
 File "/anaconda3/lib/python3.6/site-packages/pandas/io/html.py", line 815, in _parse raise_with_traceback(retained)
 File "/anaconda3/lib/python3.6/site-packages/pandas/compat/__init__.py", line 403, in raise_with_traceback
  raise exc.with_traceback(traceback)
ValueError: No tables found

Although, this page clearly has a table but it is not being picked by the read_html. I have tried using Selenium so that the page can be loaded before I read it.

from selenium import webdriver
from selenium.webdriver.common.keys import Keys

driver = webdriver.Firefox()
elem = driver.find_element_by_id("history_table")

head = elem.find_element_by_tag_name('thead')
body = elem.find_element_by_tag_name('tbody')

list_rows = []

for items in body.find_element_by_tag_name('tr'):
    list_cells = []
    for item in items.find_elements_by_tag_name('td'):

Now, the problem is that it cannot find "tr". I would appreciate any suggestions.

like image 380
Noman Bashir Avatar asked Jan 01 '23 14:01

Noman Bashir

1 Answers

Here's a solution using selenium for browser automation

from selenium import webdriver
import pandas as pd
driver = webdriver.Chrome(chromedriver)


Time    Temperature Dew Point   Humidity    Wind    Speed   Gust    Pressure  Precip. Rate. Precip. Accum.  UV  Solar
0   12:02 AM    25.5 °C 18.7 °C 75 %    East    0 kph   0 kph   29.3 hPa    0 mm    0 mm    0   0 w/m²
1   12:07 AM    25.5 °C 19 °C   76 %    East    0 kph   0 kph   29.31 hPa   0 mm    0 mm    0   0 w/m²
2   12:12 AM    25.5 °C 19 °C   76 %    East    0 kph   0 kph   29.31 hPa   0 mm    0 mm    0   0 w/m²
3   12:17 AM    25.5 °C 18.7 °C 75 %    East    0 kph   0 kph   29.3 hPa    0 mm    0 mm    0   0 w/m²
4   12:22 AM    25.5 °C 18.7 °C 75 %    East    0 kph   0 kph   29.3 hPa    0 mm    0 mm    0   0 w/m²

Editing with breakdown of exactly what's happening, since the above one-liner is actually not very good self-documenting code:

After setting up the driver, we select the table with its ID value (Thankfully this site actually uses reasonable and descriptive IDs)


Then, from that element, we get the HTML instead of the web driver element object


We use pandas to parse the html


From the docs:

"read_html returns a list of DataFrame objects, even if there is only a single table contained in the HTML content"

So we index into that list with the only table we have, at index zero

like image 74
G. Anderson Avatar answered Jan 05 '23 16:01

G. Anderson