I wrote a Scrapy spider to process Javascript content on webpages with the help of Selenium. However, I realized that this spider is significantly slower than a common Scrapy Crawler.
For this reason I want to combine two spiders: The common CrawlSpider for getting all resources and a Selenium Spider just for pages which extensively use Javascript. I created
a pipleline step that tries to detect if a webpage requires Javascript and massively uses it. So far
my ideas for the processing step failed:
<noscript> tag.<div class="yt-alert-message" >.There are so many diverse ways to indicate that a page requires Javascript!
Note: I only want to process pages with my Selenium Spider where it is really necessary as the spider is significantly slower and some pages only use it for a nice design.
You can get all JavaScript from the script tags, add it all up, and check that the length isn't more than whatever amount you think constitutes "massive" JavaScript.
# get all script tags
scripts = browser.find_elements_by_tag_name("script")
# create a string to add all the JS content to
javaScriptChars = "";
# create an list to store urls for external scripts
urls = list()
# for each script on the page...
for script in scripts
# get the src
url = script.get_attribute("scr")
# if script is external (has a 'src' attribute)...
if url.__len__() > 0:
# add the url to the list (will access it later)
urls.append(url)
else:
# the script is inline - so just get the text inside
javaScriptChars = javaScriptChars + script.getAttribute("textContent");
# for each external url found above...
for url in urls
# open the script
driver.get(url)
# add the content to our string
javaScriptChars = javaScriptChars + driver.page_source
# check if the string is longer than some threshold you choose
if javaScriptChars.__len__() > 50000:
# JS contains more than 5000 characters
The number is arbitrary. I guess less than 50000 characters of JS might not actually be "a lot" because The page might not be calling every function every time. That will likely depend somewhat on what the user does.
But if you can assume a well-designed site is only including necessary scripts, then the number of characters could still be a relevant indicator of how much JS it runs.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With