Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I tell that the page has finished loading?

I'm playing with Chromium's headless web browser API. Based on chrome_remote_shell source code, I came up with the following code:

#!/usr/bin/env python

import json
import requests
import pprint
import websocket

tablist = json.loads(requests.get("http://%s:%s/json" % ("localhost", 9222)).text)
print(tablist)
wsurl = tablist[0]['webSocketDebuggerUrl']
conn = websocket.create_connection(wsurl)
navcom = json.dumps({"id":0, "method":"Network.enable"})
conn.send(navcom)
navcom = json.dumps({"id":1, "method":"Page.navigate", "params":{"url":"https://news.ycombinator.com/"}})
conn.send(navcom)

while True:
    packet = json.loads(conn.recv())
    if 'method' in packet:
        print(packet['method'])
    else:
        print(packet)

Here's example output:

[{u'description': u'', u'title': u'Hacker News', u'url': u'https://news.ycombinator.com/', u'webSocketDebuggerUrl': u'ws://localhost:9222/devtools/page/7d03a57d-77a9-4ceb-b645-3b85461de5be', u'type': u'page', u'id': u'7d03a57d-77a9-4ceb-b645-3b85461de5be', u'devtoolsFrontendUrl': u'/devtools/inspector.html?ws=localhost:9222/devtools/page/7d03a57d-77a9-4ceb-b645-3b85461de5be'}]
{u'id': 0, u'result': {}}
Network.requestWillBeSent
{u'id': 1, u'result': {u'frameId': u'21045.1'}}
Network.responseReceived
Network.dataReceived
Network.dataReceived
Network.loadingFinished
Network.requestWillBeSent
Network.requestWillBeSent
Network.requestServedFromCache
Network.responseReceived
Network.dataReceived
Network.loadingFinished
Network.requestWillBeSent
Network.requestServedFromCache
Network.responseReceived
Network.dataReceived
Network.loadingFinished
Network.requestWillBeSent
Network.requestServedFromCache
Network.responseReceived
Network.dataReceived
Network.loadingFinished
Network.responseReceived
Network.dataReceived
Network.loadingFinished
Network.requestWillBeSent
Network.requestServedFromCache
Network.responseReceived
Network.dataReceived
Network.loadingFinished

I noticed that I get a long stream of messages, last one of them being Network.loadingFinished, but I got this one for multiple requestIds. How can I modify my script so that it terminates when the page fully loaded and I can escape the loop?

like image 645
d33tah Avatar asked Oct 15 '25 07:10

d33tah


2 Answers

It turns out I should have also subscribed to page events via Page.enable:

#!/usr/bin/env python

import json
import requests
import pprint
import websocket
import sys

tablist = json.loads(requests.get("http://%s:%s/json" % ("localhost", 9222)).text)
print(tablist)
wsurl = tablist[0]['webSocketDebuggerUrl']
conn = websocket.create_connection(wsurl)
navcom = json.dumps({"id":0, "method":"Network.enable"})
conn.send(navcom)
navcom = json.dumps({"id":1, "method":"Page.enable"})
conn.send(navcom)
navcom = json.dumps({"id":2, "method":"Page.navigate", "params":{"url":sys.argv[1]}})
conn.send(navcom)

while True:
    s = conn.recv()
    packet = json.loads(s)
    if packet.get('method') == 'Page.loadEventFired':
        break
    print(s)

What we're doing here is enabling notifications for both Page and Network items, then opening the website and reading all messages that happen after. Once we reach Page.loadEventFired, we can assume that the page finished loading, which is when we can exit the loop and carry out any actions that depend on this condition.

like image 120
d33tah Avatar answered Oct 17 '25 19:10

d33tah


In any general sense, you can't... not really.

Given dynamic web pages these days, you need to understand what the page is actually doing and look for some specific event / existence of a DOM element, or other clue.

As you see, you're getting lots of loadingFinished events, but how do you know it's the "last" one? You need to understand the page. For example, can you determine how many requests will be sent by observing that the page will make one request per specific DOM element class, or based on a javascript variable, or XHR response? If so, then you can stop once you get n responses. Or, is there something special about the last request (target, or payload) or the last response (e.g., zero length, contains the text "last", ^D, or ^Z).

Also, if the page is polling the server (often with sockets), what does "finish loading" even mean?

Update for onload

If you're looking for what would be the onload event, you don't have to do anything special. driver.get(<url>) blocks until then.

WebDriver will wait until the page has fully loaded (that is, the onload event has fired) before returning control to your test or script. It's worth noting that if your page uses a lot of AJAX on load then WebDriver may not know when it has completely loaded. If you need to ensure such pages are fully loaded then you can use waits.

like image 31
pbuck Avatar answered Oct 17 '25 19:10

pbuck