I am working on a project in which I need to crawl several websites and gather different kinds of information from them. Information like text, links, images, etc.
I am using Python for this. I have tried BeautifulSoup for this purpose on the HTML pages and it works, but I am stuck when parsing sites which contains a lot of JavaScript, as most of the information on these files is stored in the <script>
tag.
Any ideas how to do this?
First of all, scrapping and parsing JS from pages is not trivial. It can however be vastly simplified if you use a headless web client instead, which will parse everything for you just like a regular browser would.
The only difference is that its main interface is not GUI/HMI but an API.
For example, you can use PhantomJS with Chrome or Firefox which both support headless mode.
For a more complete list of headless browsers check here.
If there is a lot of javascript dynamic load involved in the page loading, things get more complicated.
Basically, you have 3 ways to crawl the data from the website:
Also take a look at Scrapy web-scraping framework - it doesn't handle AJAX calls too, but this is really the best tool in web-scraping world I've ever worked with.
Also see these resources:
Hope that helps.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With