Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Get web page content (Not from source code) [duplicate]

I want to get the rainfall data of each day from here.

When I am in inspect mode, I can see the data. However, when I view the source code, I cannot find it.

I am using urllib2 and BeautifulSoup from bs4

Here is my code:

import urllib2
from bs4 import BeautifulSoup
link = "http://www.hko.gov.hk/cis/dailyExtract_e.htm?y=2015&m=1"

r = urllib2.urlopen(link)
soup = BeautifulSoup(r)
print soup.find_all("td", class_="td1_normal_class")
# I also tried this one
# print.find_all("div", class_="dataTable")

And I got an empty array.

My question is: How can I get the page content, but not from the page source code?

like image 216
VICTOR Avatar asked Mar 05 '26 05:03

VICTOR


2 Answers

If you open up the dev tools on chrome/firefox and look at the requests, you'll see that the data is generated from a request to http://www.hko.gov.hk/cis/dailyExtract/dailyExtract_2015.xml which gives the data for all 12 months which you can then extract from.

like image 128
Asish M. Avatar answered Mar 06 '26 18:03

Asish M.


If you cannot find the div in the source it means that the div you are looking for is generated. It could be using some JS framework like Angular or just JQuery. If you want to browse through the rendered HTML you have to use a browser which runs the JS code included.

Try using selenium

How can I parse a website using Selenium and Beautifulsoup in python?

from bs4 import BeautifulSoup
from selenium import webdriver

driver = webdriver.Firefox()
driver.get('http://www.hko.gov.hk/cis/dailyExtract_e.htm?y=2015&m=1')

html = driver.page_source
soup = BeautifulSoup(html)

print soup.find_all("td", class_="td1_normal_class")

However note that using Selenium considerabily slows down the process since it has to pull up a headless browser.

like image 29
Simone Zandara Avatar answered Mar 06 '26 17:03

Simone Zandara



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!