Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

simulate infinite scrolling in c# to get full html of a page

There are lots of sites that use this (imo) annoying "infinite scrolling" style. Examples of this are sites like tumblr, twitter, 9gag, etc..

I recently tried to scrape some pics off of these sites programatically with HtmlAgilityPack. like this:

HtmlWeb web = new HtmlWeb();  
HtmlDocument doc = web.Load(url);
var primary = doc.DocumentNode.SelectNodes("//img[@class='badge-item-img']");
var picstring = primary.Select(r => r.GetAttributeValue("src", null)).FirstOrDefault();

This works fine, but when I tried to load in the HTML from certain sites, I noticed that I only got back a small amount of content (lets say the first 10 "posts" or "pictures", or whatever..) This made me wonder if it would be possible to simulate the "scrolling down to the bottom" of the page in c#.

This isn't just the case when I load the html programatically, when I simply go to sites like tumblr, and I check firebug or just "view source", I expected that all the content would be in there somewhere, but alot of it seems to be hidden/inserted with javascript. Only the content that is actually visible on my screen is present in the HTML source.

So my questions is: is it possible to simulate infinitely scrolling down to a page, and loading in that HTML with c# (preferably)?

(I know that I can use API's for tumblr and twitter, but i'm just trying to have some fun hacking stuff together with HtmlAgilityPack)

like image 811
Thousand Avatar asked Jul 24 '13 18:07

Thousand


People also ask

Is infinite scroll addictive?

Although we all use the infinite scroll every day, it has a dark side. It's one of the most addictive patterns out there, and it's difficult to resist its pull. It's human nature to seek predictability and patterns.

Is infinite scroll lazy loading?

Infinite scroll uses lazy loading and executes its demand to load more data (products or content) at the bottom of the page, without requiring an action such as the click of a button. On-demand loading is used to optimize content delivery by reducing time consumption and memory usage.

What is endless scrolling called?

What Is Infinite Scroll? A web design technique where, as the user scrolls down a page, more content automatically and continuously loads at the bottom, eliminating the user's need to click to the next page. The idea behind infinite scroll is that it allows people to enjoy a frictionless browsing experience.


2 Answers

There is no way to reliably do this for all such websites in one shot, short of embedding a web browser (which typically won't work in headless environments).

What you should consider doing instead is looking at the site's JavaScript in order to see what AJAX queries are used to fetch content as the user scrolls down.

Alternatively, use a web debugger in your browser (such as the one included in Chrome). These debuggers usually have a "network" pane you can use to inspect AJAX requests performed by the page. Looking at these requests as you scroll down should give you enough information to write C# code that simulates those requests.

You will then have to parse the response from those requests as whatever type of content that particular API delivers, which will probably be JSON or XML, but almost certainly not HTML. (This may be better for you anyway, since it will save you having to parse out display-oriented HTML, whereas the AJAX API will give you data objects that should be much easier to use.)

like image 190
cdhowie Avatar answered Oct 16 '22 21:10

cdhowie


Those sites are making asynchronous http requests to load the subsequent page contents. Since HTML agility pack doesn't have a javascript interpreter (thank heavens for that), you will need to make those requests yourself. It is most likely that most sites will not return html fragments, but rather JSON. For that, you'll need to use a JSON parser, not the HTML agility pack.

like image 1
recursive Avatar answered Oct 16 '22 20:10

recursive