So I'm using python and beautifulsoup4(which i'm not tied to) to scrape a website. Problem is when I use urlib to grab the html of a page it's not the entire page because some of it is generated via the javascript. Is there any way to get around this?
There are two approaches to scraping a dynamic webpage: Scrape the content directly from the JavaScript. Scrape the website as we view it in our browser — using Python packages capable of executing the JavaScript.
It is ideally not possible because BeautifulSoup is just an HTML parser. So in those scenarios it is better to use Selenium to pull dynamic content.
Whether it's a web or mobile application, JavaScript now has the right tools. This article will explain how the vibrant ecosystem of NodeJS allows you to efficiently scrape the web to meet most of your requirements.
There are basically two main options to proceed with:
The first option is more difficult to implement and it's, generally speaking, more fragile, but it doesn't require a real browser and can be faster.
The second option is better in terms of you get what any other real user gets and you wouldn't be worried about how the page was loaded. Selenium is pretty powerful in locating elements on a page - you may not need BeautifulSoup
at all. But, anyway, this option is slower than the first one.
Hope that helps.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With