Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

how to get webpage resource content via chrome remote debugging

i want get webpage resource content use python via Chrome Debugging Protocol,from this page method-getResourceContent,i noticed this method:getResourceContent,need params frameId and url.i think this method is what i need. so i did this thing:

1.get start chrome as a server: .\chrome.exe --remote-debugging-port=9222

2.write python test code:

# coding=utf-8
"""
chrome --remote-debugging api test
"""

import json
import requests
import websocket

import pdb

def send():
    geturl = requests.get('http://localhost:9222/json')
    websocketURL = json.loads(geturl.content)[0]['webSocketDebuggerUrl']
    request = {}
    request['id'] = 1
    request['method'] = 'Page.navigate'
    request['params'] = {"url": 'http://global.bing.com'}
    ws = websocket.create_connection(websocketURL)
    ws.send(json.dumps(request))
    res = ws.recv()
    ws.close()
    print res

    frameId = json.loads(res)['result']['frameId']
    print frameId
    geturl = requests.get('http://localhost:9222/json')
    websocketURL = json.loads(geturl.content)[0]['webSocketDebuggerUrl']
    req = {}
    req['id'] = 1
    req['method'] = 'Page.getResourceContent'
    req['params'] = {"frameId":frameId,"url": 'http://global.bing.com'}
    header = ["User-Agent: Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36"]
    pdb.set_trace()
    ws = websocket.create_connection(websocketURL,header=header)
    ws.send(json.dumps(req))
    ress = ws.recv()
    ws.close()
    print ress
if __name__ == '__main__':
    send()

3.Page.navigate work fine,i got something like this: {"id":1,"result":{"frameId":"8504.2"}}

4.when i try method:getResourceContent,error came out: {"error":{"code":-32000,"message":"Agent is not enabled."},"id":1}

i tried to add User-Agent,still not work.

Thanks.

like image 408
yang Avatar asked Aug 01 '16 07:08

yang


1 Answers

The error message "Agent is not enabled" has nothing to do with the HTTP User-Agent header but refers to an agent within chrome that needs to be enabled in order to retrieve page contents.

The term "agent" is a bit misleading since the protocol documentation speaks about domains which need to be enabled in order to debug them (the term "agent" refers to the way this is implemented in Chrome internally, I suppose)

So, the question is which domain does need to be enabled in order to access the page contents? In hindsight it is quite obvious: the Page domain needs to be enabled as we are calling a method in this domain. I only found this out after stumbling over this example, though.

Once I added the Page.enable request to script to activate the Page domain, the error message disappeared. However, I encountered two other problems:

  1. The websockets connection needs to be kept open between requests as Chrome keeps some state between invocations (such as whether the agent is enabled)
  2. When navigating to http://global.bing.com/ the browser is redirected to http://www.bing.com/ (at least it is on my computer). This causes Page.getResourceContent to fail to retrieve the resource because the requested resource http://global.bing.com/ is not available.

After fixing these issues I was able to retrieve the page content. This is my code:

# coding=utf-8
"""
chrome --remote-debugging api test
"""

import json
import requests
import websocket

def send():
    # Setup websocket connection:
    geturl = requests.get('http://localhost:9222/json')
    websocketURL = json.loads(geturl.content)[0]['webSocketDebuggerUrl']
    ws = websocket.create_connection(websocketURL)

    # Navigate to global.bing.com:
    request = {}
    request['id'] = 1
    request['method'] = 'Page.navigate'
    request['params'] = {"url": 'http://global.bing.com'}
    ws.send(json.dumps(request))
    result = ws.recv()
    print "Page.navigate: ", result
    frameId = json.loads(result)['result']['frameId']

    # Enable page agent:
    request = {}
    request['id'] = 1
    request['method'] = 'Page.enable'
    request['params'] = {}
    ws.send(json.dumps(request))
    print 'Page.enable: ', ws.recv()

    # Retrieve resource contents:
    request = {}
    request['id'] = 1
    request['method'] = 'Page.getResourceContent'
    request['params'] = {"frameId": frameId, "url": 'http://www.bing.com'}
    ws.send(json.dumps(request))
    result = ws.recv()
    print("Page.getResourceContent: ", result)

    # Close websocket connection
    ws.close()

if __name__ == '__main__':
    send()
like image 59
Christoph Böhme Avatar answered Nov 04 '22 10:11

Christoph Böhme