Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to generate images of web pages in a high performance environment?

I am trying to generate images of web pages in under a second in a server-side environment. The requests could come in parallel, at the same time from the web. To that end, I am using Puppeteer-Sharp library which works pretty well. On the back end its using Chromium to load the page and then screenshot it.

The problem is that it takes a while to get started. For instance, note the timings (from my pc) from the readme.md sample code:

var options = new new LaunchOptions {Headless = true, ExecutablePath = @"c:\foo\chrome.exe"};
var browser = await Puppeteer.LaunchAsync(options).Result;    //  ~500ms
var page = browser.NewPageAsync().Result;                     //  ~215ms
var webPage = page.GoToAsync("http://www.google.com").Result; //  ~500ms
var screenshot = page.ScreenshotAsync(outputFile);            
screenshot.wait();                                            //  ~300ms   

As you can see, it easily goes over a second. I don't know how Chromium works internally, so I have a couple of questions pertaining to solutions that I am thinking of.

  1. Is the PuppeteerSharp.Browser object thread-safe and/or re-entrant? Can I use the same browser object from different threads? I am thinking not, because it's tied to a specific instance of Chromium in memory.
  2. If I cut out .LaunchAsync and .NetPageAsync from every request that will significantly speed up the operation. Will pool of PuppeteerSharp.Browser objects work? For instance, I can pre-allocate 5 of these and execute .NetPageAsyncon them. Then the incoming requests would use the objects from the pool. Is that a viable approach?
like image 541
AngryHacker Avatar asked Jan 26 '23 17:01

AngryHacker


1 Answers

Although there are still many improvements going on, Puppeteer-Sharp is thread-safe. To improve loading performance, there are a few approaches you can take.

Launch one browser and then connect to it

You can launch one (real) browser and then use the ConnectAsync method to connect to it.

await new BrowserFetcher().DownloadAsync(BrowserFetcher.DefaultRevision);
var browser = await Puppeteer.LaunchAsync(new LaunchOptions
{
    Headless = false,
});

var theBrowser1 = await Puppeteer.ConnectAsync(new ConnectOptions { BrowserWSEndpoint = browser.WebSocketEndpoint });
var theBrowser2 = await Puppeteer.ConnectAsync(new ConnectOptions { BrowserWSEndpoint = browser.WebSocketEndpoint });
var page1 = await theBrowser1.NewPageAsync();
var page2 = await theBrowser2.NewPageAsync();

await Task.WhenAll(
    page1.GoToAsync("https://www.stackoverflow.com"),
    page2.GoToAsync("https://serverfault.com/")
);

I know that code is not running in parallel, but you'll get the idea about reusing the same browser.

Create new pages on the same browser

If you are using TPL, you shouldn't have any issues creating new pages from different threads using the same browser.

await new BrowserFetcher().DownloadAsync(BrowserFetcher.DefaultRevision);
var browser = await Puppeteer.LaunchAsync(new LaunchOptions
{
    Headless = false,
});

var urls = new string[]
{
    "https://www.stackoverflow.com",
    "https://www.stackoverflow.com",
    "https://www.stackoverflow.com",
    "https://www.stackoverflow.com",
    "https://www.stackoverflow.com",
    "https://www.stackoverflow.com",
    "https://www.stackoverflow.com",
    "https://www.stackoverflow.com",
    "https://www.stackoverflow.com",
    "https://www.stackoverflow.com",
    "https://www.stackoverflow.com"
};

await Task.WhenAll(
    urls.Select(url => Task.Factory.StartNew(async () =>
    {
        var page = await browser.NewPageAsync();
        return page.GoToAsync(url);
    })));

Again, this example is just to give you an idea of how this could be accomplished.

Pages queue

There is one user who created a queue of X pages (for x from 0 to X => NewPage) and then he would grab pages from that queue. You can see the example here.

like image 70
hardkoded Avatar answered Jan 29 '23 20:01

hardkoded