Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

html get around the noscript tag

I'm using the python library requests to download some webpages and do some parsing after that, eg, get the title of the page. However, it seems requests can't download the source correctly when there's <noscript> tag on some webpages.

For example, when trying to get the source of https://www.coursera.org/course/startup, the source I get from requests is different from visiting the page with Chrome. The source requests get is the same with the view source option in Chrome.

So is there any way to "fool" the <noscript> tag in some way? Or I need to use something else rather than requests?

like image 780
Gnijuohz Avatar asked Apr 24 '26 20:04

Gnijuohz


1 Answers

"The source requests get is the same with the view source option in Chrome" ...view source gives you the real html source of the url, same as requests gets. So what you're seeing is what you should expect to see.

Your problem is nothing to do with the noscript tag, it's that the content of the page is changed via javascript after loading.

As @alecxe pointed out, you need to look deeper into how the coursera site is built, eg observing XHR requests in the 'Network' tab of Chrome Developer Tools, to see the urls where the actual content you're looking for is loaded from... then you may be able to just load those urls directly with Requests.

Alternatively there is a tutorial here for how to get round the problem of rendering a web page with javascript from python:
https://impythonist.wordpress.com/2015/01/06/ultimate-guide-for-scraping-javascript-rendered-web-pages/

they provide example code that looks like this:

import sys  
from PyQt4.QtGui import *  
from PyQt4.QtCore import *  
from PyQt4.QtWebKit import *  
from lxml import html 

#Take this class for granted.Just use result of rendering.
class Render(QWebPage):  
  def __init__(self, url):  
    self.app = QApplication(sys.argv)  
    QWebPage.__init__(self)  
    self.loadFinished.connect(self._loadFinished)  
    self.mainFrame().load(QUrl(url))  
    self.app.exec_()  

  def _loadFinished(self, result):  
    self.frame = self.mainFrame()  
    self.app.quit()  

url = 'http://pycoders.com/archive/'  
r = Render(url)  
result = r.frame.toHtml()
#This step is important.Converting QString to Ascii for lxml to process
archive_links = html.fromstring(str(result.toAscii()))
print archive_links
like image 182
Anentropic Avatar answered Apr 27 '26 10:04

Anentropic



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!