Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I detect if a page massively uses Javascript with Python, Scrapy and Selenium?

I wrote a Scrapy spider to process Javascript content on webpages with the help of Selenium. However, I realized that this spider is significantly slower than a common Scrapy Crawler. For this reason I want to combine two spiders: The common CrawlSpider for getting all resources and a Selenium Spider just for pages which extensively use Javascript. I created a pipleline step that tries to detect if a webpage requires Javascript and massively uses it. So far my ideas for the processing step failed:

  • Some pages use the common <noscript> tag.
  • Some pages print an alert message e.g. <div class="yt-alert-message" >.
  • ...

There are so many diverse ways to indicate that a page requires Javascript!

  • Do you know a standardized way how I can 'detect' pages which extensively use Javascript?

Note: I only want to process pages with my Selenium Spider where it is really necessary as the spider is significantly slower and some pages only use it for a nice design.

like image 774
Jon Avatar asked Mar 19 '26 11:03

Jon


1 Answers

You can get all JavaScript from the script tags, add it all up, and check that the length isn't more than whatever amount you think constitutes "massive" JavaScript.

# get all script tags
scripts = browser.find_elements_by_tag_name("script")

# create a string to add all the JS content to
javaScriptChars = "";   

# create an list to store urls for external scripts
urls = list()

# for each script on the page...
for script in scripts

    # get the src
    url = script.get_attribute("scr")

    # if script is external (has a 'src' attribute)...
    if url.__len__() > 0:

        # add the url to the list (will access it later)
        urls.append(url)

    else:

        # the script is inline - so just get the text inside
        javaScriptChars = javaScriptChars + script.getAttribute("textContent");

# for each external url found above...
for url in urls

    # open the script
    driver.get(url)

    # add the content to our string
    javaScriptChars = javaScriptChars + driver.page_source

# check if the string is longer than some threshold you choose                              
if javaScriptChars.__len__() > 50000:
     # JS contains more than 5000 characters

The number is arbitrary. I guess less than 50000 characters of JS might not actually be "a lot" because The page might not be calling every function every time. That will likely depend somewhat on what the user does.

But if you can assume a well-designed site is only including necessary scripts, then the number of characters could still be a relevant indicator of how much JS it runs.

like image 107
Dingredient Avatar answered Mar 21 '26 23:03

Dingredient