Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Scraping HTML and JavaScript

I am working on a project in which I need to crawl several websites and gather different kinds of information from them. Information like text, links, images, etc.

I am using Python for this. I have tried BeautifulSoup for this purpose on the HTML pages and it works, but I am stuck when parsing sites which contains a lot of JavaScript, as most of the information on these files is stored in the <script> tag.

Any ideas how to do this?

like image 938
user1934948 Avatar asked Mar 31 '14 14:03

user1934948


2 Answers

First of all, scrapping and parsing JS from pages is not trivial. It can however be vastly simplified if you use a headless web client instead, which will parse everything for you just like a regular browser would.
The only difference is that its main interface is not GUI/HMI but an API.

For example, you can use PhantomJS with Chrome or Firefox which both support headless mode.

For a more complete list of headless browsers check here.

like image 165
bosnjak Avatar answered Sep 23 '22 05:09

bosnjak


If there is a lot of javascript dynamic load involved in the page loading, things get more complicated.

Basically, you have 3 ways to crawl the data from the website:

  • using browser developer tools see what AJAX requests are going on a page load. Then simulate these requests in your crawler. You will probably need the help of json and requests modules.
  • use tools that utilizes real browsers like selenium. In this case you don't care how the page is loaded - you'll get what a real user see. Note: you can use a headless browser too.
  • see if the website provides an API (e.g. walmart API)

Also take a look at Scrapy web-scraping framework - it doesn't handle AJAX calls too, but this is really the best tool in web-scraping world I've ever worked with.

Also see these resources:

  • Web-scraping JavaScript page with Python
  • Scraping javascript-generated data using Python
  • web scraping dynamic content with python
  • How to use Selenium with Python?
  • Headless Selenium Testing with Python and PhantomJS
  • selenium with scrapy for dynamic page

Hope that helps.

like image 35
alecxe Avatar answered Sep 26 '22 05:09

alecxe