Currently I have Selenium hooked up to python to scrape a webpage. I found out that the page actually pulls data from a JSON API, and I can get a JSON response as long as I'm logged in to the page.
However, my approach of getting that response into python seems a bit junky; I select text enclosed in <pre>
tags and use python's json
package to parse the data like so:
import json
from selenium import webdriver
url = 'http://jsonplaceholder.typicode.com/posts/1'
driver = webdriver.Chrome()
driver.get(url)
json_text = driver.find_element_by_css_selector('pre').get_attribute('innerText')
json_response = json.loads(json_text)
The only reason I need to select within <pre>
tags at all is because when JSON appears in Chrome, it comes formatted like this:
<html>
<head></head>
<body>
<pre style="word-wrap: break-word; white-space: pre-wrap;">{
"userId": 1,
"id": 1,
"title": "sunt aut facere repellat provident occaecati excepturi optio reprehenderit",
"body": "quia et suscipit\nsuscipit recusandae consequuntur expedita et cum\nreprehenderit molestiae ut ut quas totam\nnostrum rerum est autem sunt rem eveniet architecto"
}</pre>
</body>
</html>
And the only reason I need to do this inside selenium at all is because I need to be logged into the website in order to get a response. Otherwise I get a 401 and no data.
GET JSON dataawait fetch('/api/names') starts a GET request, and evaluates to the response object when the request is complete. Then, from the server response, you can parse the JSON into a plain JavaScript object using await response. json() (note: response.
Page Object Model using Page Factory in Selenium WebDriver. An open source Java library which can be used to serialize and deserialize Java objects to (and from) JSON. JSON is Java Script Object Notation, an open standard format that uses human-readable text to transmit data objects consisting of attribute-value pairs.
You can find the pre
element and get it's text, then load it via json.loads()
:
import json
pre = driver.find_element_by_tag_name("pre").text
data = json.loads(pre)
print(data)
Also, if this does not work as-is, and, as suggested by @Skandix in comments, prepend view-source:
to your url.
Also, you may avoid using selenium
to get the desired JSON data and transfer the cookies from selenium
to requests
to keep "staying logged in", see:
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With