Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Scraping websites with Javascript enabled?

I'm trying to scrape and submit information to websites that heavily rely on Javascript to do most of its actions. The website won't even work when i disable Javascript in my browser.

I've searched for some solutions on Google and SO and there was someone who suggested i should reverse engineer the Javascript, but i have no idea how to do that.

So far i've been using Mechanize and it works on websites that don't require Javascript.

Is there any way to access websites that use Javascript by using urllib2 or something similar? I'm also willing to learn Javascript, if that's what it takes.

like image 279
user216171 Avatar asked Jul 29 '10 13:07

user216171


People also ask

Is web scraping possible with JavaScript?

Benefits of Web Scraping with JavaScriptGathering data from different sources for analysis can be automated with web scraping easily. It can be used to collect data for testing and training machine learning models.

Can BeautifulSoup scrape JavaScript?

Beautiful Soup is a very powerful library that makes web scraping by traversing the DOM (document object model) easier to implement. But it does only static scraping. Static scraping ignores JavaScript. It fetches web pages from the server without the help of a browser.

Which is better for web scraping JavaScript or Python?

Python is your best bet. Libraries such as requests or HTTPX makes it very easy to scrape websites that don't require JavaScript to work correctly. Python offers a lot of simple-to-use HTTP clients. And once you get the response, it's also very easy to parse the HTML with BeautifulSoup for example.

Can a website block you from web scraping?

If you send repetitive requests from the same IP, the website owners can detect your footprint and may block your web scrapers by checking the server log files. To avoid this, you can use rotating proxies. A rotating proxy is a proxy server that allocates a new IP address from a set of proxies stored in the proxy pool.


2 Answers

I wrote a small tutorial on this subject, this might help:

http://koaning.io.s3-website.eu-west-2.amazonaws.com/dynamic-scraping-with-python.html

Basically what you do is you have the selenium library pretend that it is a firefox browser, the browser will wait until all javascript has loaded before it continues passing you the html string. Once you have this string, you can then parse it with beautifulsoup.

like image 154
cantdutchthis Avatar answered Oct 04 '22 14:10

cantdutchthis


I've had exactly the same problem. It is not simple at all, but I finally found a great solution, using PyQt4.QtWebKit.

You will find the explanations on this webpage : http://blog.motane.lu/2009/07/07/downloading-a-pages-content-with-python-and-webkit/

I've tested it, I currently use it, and that's great !

Its great advantage is that it can run on a server, only using X, without a graphic environment.

like image 36
Guillaume Lebourgeois Avatar answered Oct 04 '22 14:10

Guillaume Lebourgeois