Web scraping a website with dynamic javascript content

Tags:

So I'm using python and beautifulsoup4(which i'm not tied to) to scrape a website. Problem is when I use urlib to grab the html of a page it's not the entire page because some of it is generated via the javascript. Is there any way to get around this?

709

asked Mar 28 '14 14:03

Igglyboo

1 Answers

There are basically two main options to proceed with:

using browser developer tools, see what ajax requests are going to load the page and simulate them in your script, you will probably need to use json module to load the response json string into python data structure
use tools like selenium that open up a real browser. The browser can also be "headless", see Headless Selenium Testing with Python and PhantomJS

The first option is more difficult to implement and it's, generally speaking, more fragile, but it doesn't require a real browser and can be faster.

The second option is better in terms of you get what any other real user gets and you wouldn't be worried about how the page was loaded. Selenium is pretty powerful in locating elements on a page - you may not need BeautifulSoup at all. But, anyway, this option is slower than the first one.

Hope that helps.

146

answered Oct 30 '22 17:10

alecxe

Related questions
                            
                                How to encode PCM data to MP3 in JavaScript?
                            
                                EJS templates: How to generate an HTML tree structure in the most elegant and handy way
                            
                                Summarize array of objects and calculate average value for each unique object name
                            
                                How can I use a Ruby variable in an inline JavaScript in haml?
                            
                                how to open popup within popup in magnific popup plugin
                            
                                Node.js - call a method after another method is fully executed
                            
                                Manually put pcm data into AudioBuffer
                            
                                Disable horizontal repeating of world map with mapbox
                            
                                How to find out if a variable exists or not in Dart
                            
                                alternative to async: false ajax
                            
                                How to hide and display asp:buttons in asp.net from code behind?
                            
                                Polymer - Iterating over object in template
                            
                                How to blur Selectize.js input after selection has been made in Bootstrap 3?
                            
                                Autocompleting only the place name with Google places API
                            
                                Javascript/HTML5: get current time of audio tag
                            
                                How to prevent IE11 pop up (Are you sure you want to leave this page)
                            
                                Autofilling state and city based on zip code
                            
                                Needed canvas blurring tool
                            
                                Rock, Paper, Scissors, Lizard, Spock in JavaScript
                            
                                download img throught hyperlink <a> in IE11 using javascript

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Web scraping a website with dynamic javascript content

Tags:

python

javascript

html-parsing

beautifulsoup

web-scraping

Igglyboo

People also ask

1 Answers

alecxe

Recent Activity

Donate For Us