Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Getting AJAX content loaded when scrolled with Selenium Wedriver

I am using Selenium WebDriver to get the content of a site. (Note: the site has no API. Wish it did.) The site uses AJAX to dynamically load content when the user scrolls. To get that content, I've been using Javascript to scroll down and then trying to access the content using findElements().

To be clear about the setup, the page contains several nested elements, one of which is a div with the "GridItems" class (no name or id). This div contains many child elements with the "Item" class (again, no name or id, just the class). I want to get every element with class "Item" in the div. About 25 items are accessible when the page first loads (not necessarily visible in the current window, but available in the DOM), and scrolling down loads more.

My main issues are as follows: first, I want to stop scrolling when I get to the bottom. However, I can't figure out what stopping condition to use. How can I determine when I've reached the bottom of the page? Window.scrollheight won't work, because that will give the height of the existing window, not what it will be after it's finished adding more content. I've thought of testing if an element at the bottom of the page is visible/clickable, but if it's not, it may be just because it hasn't loaded yet, not because it hasn't been reached. Even using a Wait may not work because if it times out, I don't know if it's because it hasn't reached the bottom, or just because it's taking a long time to load.

The second problem is that when I scroll down, it loads some more elements, but eventually, scrolling down loads more from the bottom and drops the top ones of the DOM. This means that I can't just scroll down to the bottom and then use findElements() to get all Items, because many of the first ones will be gone. I know how many items to expect, so currently, I'm doing the following:

    int numitems = 135;
    List<WebElement> newitems;
    List<WebElement> allitems = new ArrayList<WebElement>(50);
    
    do {
        //scroll down the full length of the visible window three times
        for(int i=0; i < 3; i++)
        {
            //scroll down
            js.executeScript("window.scrollTo(0, document.body.offsetHeight)");

        }
        
        //check how many items are now available
  //if it runs too fast, it may get to the next line before it finishes scrolling;
  //make it wait until the desired div is visible 
    WebElement cont =  (new WebDriverWait(driver, 100))
.until(ExpectedConditions.presenceOfElementLocated(By.className("GridItems")));
                

       //get all Items in the div
        newitems = cont.findElements(By.className("Item"));
        

    //add all the items extracted after scrolling 3 times to the list   
        allitems.addAll(newitems);
      
    //repeat until there are more items in the general list than are expected
    //to be found. This is hacky; I wish there was a better stopping condition
    }while(numitems > allitems.size()); 

That is, I scroll the page three times, get all elements available after the scrolling, and add them to a list. I repeat this until there are more elements in the list than I was expecting to be found.

The problem with this is that since scrolling adds a different number of items to the DOM each time, there is often overlap between what is added to the allitems list at each iteration. The Elements are just objects with unique ids an contain no information about the actual HTML, so I can't check if they're duplicated. I may also lose some items if the scrolling doesn't overlap perfectly. Also, since I've scrolled down, the earlier items in the list that have fallen off the top lose their connection to the DOM and then I get a StaleElementReferenceException when I try to process them.

I can process each item as I get it, I suppose, though it will make the code clunky. This will also allow me to check its actual content and find the duplicates. I'm not sure that this will ensure that I don't skip any.

Does anyone have any suggestions for how best to do this? Am I missing something very important/obvous here? The other questions here on SO about AJAX content loading address somewhat different problems. (e.g. I generally don't have an issue with content not loading and having to wait for it, though I did include a Wait.) It seems that there should be a better way of doing this - is there?

Sorry for the long-winded post; I hope it was clear.

Thank you so much, bsg

Edit:

I realize that the accepted answer only answers part of the question. For the rest of it, I found that scrolling down one screen at a time and getting all new elements each time meant that I didn't lose any. After each scroll, I got all elements loaded and did some processing to save the content of each one. This introduces a lot of redundancy, which I used a HashSet to eliminate. I stop scrolling when I reach the bottom, as determined by the code in the accepted answer. Hope this helps.

like image 843
bsg Avatar asked Nov 13 '22 01:11

bsg


1 Answers

How can I determine when I've reached the bottom of the page?

JS doesn't work well for that, so I used jQuery to determine that. Once I reach the bottom this evaluates to true:

$(document).height() == ($(window).height() + $(window).scrollTop();

Is there anything that distinguishes their uniqueness? Your flickr example contains images, the url to the image could be use for this, by using WebElement.getAttribute("src") to create a unique identifier.

like image 192
Daniël W. Crompton Avatar answered Nov 14 '22 21:11

Daniël W. Crompton