html get around the noscript tag

Question

I'm using the python library requests to download some webpages and do some parsing after that, eg, get the title of the page. However, it seems requests can't download the source correctly when there's <noscript> tag on some webpages.

For example, when trying to get the source of https://www.coursera.org/course/startup, the source I get from requests is different from visiting the page with Chrome. The source requests get is the same with the view source option in Chrome.

So is there any way to "fool" the <noscript> tag in some way? Or I need to use something else rather than requests?

Anentropic · Accepted Answer

"The source requests get is the same with the view source option in Chrome" ...view source gives you the real html source of the url, same as requests gets. So what you're seeing is what you should expect to see.

Your problem is nothing to do with the noscript tag, it's that the content of the page is changed via javascript after loading.

As @alecxe pointed out, you need to look deeper into how the coursera site is built, eg observing XHR requests in the 'Network' tab of Chrome Developer Tools, to see the urls where the actual content you're looking for is loaded from... then you may be able to just load those urls directly with Requests.

Alternatively there is a tutorial here for how to get round the problem of rendering a web page with javascript from python:
https://impythonist.wordpress.com/2015/01/06/ultimate-guide-for-scraping-javascript-rendered-web-pages/

they provide example code that looks like this:

import sys  
from PyQt4.QtGui import *  
from PyQt4.QtCore import *  
from PyQt4.QtWebKit import *  
from lxml import html 

#Take this class for granted.Just use result of rendering.
class Render(QWebPage):  
  def __init__(self, url):  
    self.app = QApplication(sys.argv)  
    QWebPage.__init__(self)  
    self.loadFinished.connect(self._loadFinished)  
    self.mainFrame().load(QUrl(url))  
    self.app.exec_()  

  def _loadFinished(self, result):  
    self.frame = self.mainFrame()  
    self.app.quit()  

url = 'http://pycoders.com/archive/'  
r = Render(url)  
result = r.frame.toHtml()
#This step is important.Converting QString to Ascii for lxml to process
archive_links = html.fromstring(str(result.toAscii()))
print archive_links

html get around the noscript tag

Tags:

python

html

noscript

Gnijuohz

1 Answers

Anentropic

Recent Activity

Donate For Us

html get around the noscript tag

Tags:

python

html

noscript

Gnijuohz

1 Answers

Anentropic

Related questions

Recent Activity

Donate For Us