Scraping HTML and JavaScript

Question

I am working on a project in which I need to crawl several websites and gather different kinds of information from them. Information like text, links, images, etc.

I am using Python for this. I have tried BeautifulSoup for this purpose on the HTML pages and it works, but I am stuck when parsing sites which contains a lot of JavaScript, as most of the information on these files is stored in the <script> tag.

Any ideas how to do this?

bosnjak · Accepted Answer

First of all, scrapping and parsing JS from pages is not trivial. It can however be vastly simplified if you use a headless web client instead, which will parse everything for you just like a regular browser would.
The only difference is that its main interface is not GUI/HMI but an API.

For example, you can use PhantomJS with Chrome or Firefox which both support headless mode.

For a more complete list of headless browsers check here.

alecxe · Answer

If there is a lot of javascript dynamic load involved in the page loading, things get more complicated.

Basically, you have 3 ways to crawl the data from the website:

using browser developer tools see what AJAX requests are going on a page load. Then simulate these requests in your crawler. You will probably need the help of json and requests modules.
use tools that utilizes real browsers like selenium. In this case you don't care how the page is loaded - you'll get what a real user see. Note: you can use a headless browser too.
see if the website provides an API (e.g. walmart API)

Also take a look at Scrapy web-scraping framework - it doesn't handle AJAX calls too, but this is really the best tool in web-scraping world I've ever worked with.

Also see these resources:

Web-scraping JavaScript page with Python
Scraping javascript-generated data using Python
web scraping dynamic content with python
How to use Selenium with Python?
Headless Selenium Testing with Python and PhantomJS
selenium with scrapy for dynamic page

Hope that helps.

Scraping HTML and JavaScript

Tags:

python

javascript

parsing

web-scraping

web-crawler

user1934948

2 Answers

bosnjak

alecxe

Recent Activity

Donate For Us

Scraping HTML and JavaScript

Tags:

python

javascript

parsing

web-scraping

web-crawler

user1934948

2 Answers

bosnjak

alecxe

Related questions

Recent Activity

Donate For Us