I would really appreciate if someone could help me with a problem. I am trying to scrape website https://www.marketwatch.com/investing/index/xxx as xxx being stock symbol. For example https://www.marketwatch.com/investing/index/spx. My code worked more than year but for some reason does not work anymore as requesting a page will return some weird part on html. As you can see the webpage is more complicated than my request result. I also tried beautifulsoup and so on as I though that problem is about javascript, but I get a same result.
Part of code (with requests):
url = "https://www.marketwatch.com/investing/index/spx"
page = requests.get(url)
print(page.content)
Result:
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8"/>
<link href="about:blank" rel="shortcut icon"/>
<script src="https://cdnjs.cloudflare.com/ajax/libs/json3/3.3.2/json3.min.js"> </script>
<script src="https://resources.kasadapolyform.io/kpfp.js"></script>
<script src="/149e9513-01fa-4fb0-aad4-566afd725d1b/2d206a39-8ed7-437e-a3be- 862e0f06eea3/fingerprint/script/kpf.js?url=/149e9513-01fa-4fb0-aad4-566afd725d1b/2d206a39-8ed7-437e-a3be-862e0f06eea3/fingerprint&token=46f828d0-bb88-fcd0-c7ad-47f18d3c13a2"></script>
</head>
<body>
</body>
</html>
I would really appreciate the help.
As mentioned by Jaxi, the html returned implies that the page is almost entirely rendered by JavaScript instead of HTML.
In order to work around this you will need to use a tool which will allow you to run the JavaScript and then use that HTML.
One example is Selenium, which is used in UI testing.
Another is Kenneth Reitz's (the original author of the requests package) package requests_html. This will use the Chromium browser under the hood and render the page for you. From the README:
>>> r = session.get('http://python-requests.org')
>>> r.html.render()
>>> r.html.search('Python 2 will retire in only {months} months!')['months']
'<time>25</time>'
As a side note, as mentioned by ewindes, you should always be careful and make sure that the sites you are scraping permit web scraping. If not as a matter of legality, than one of courtesy.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With