I am downloading HTML pages that have data defined in them in the following way:
... <script type= "text/javascript"> window.blog.data = {"activity":{"type":"read"}}; </script> ...
I would like to extract the JSON object defined in 'window.blog.data'. Is there a simpler way than parsing it manually? (I am looking into Beautiful Soap but can't seem to find a method that will return the exact object without parsing)
Thanks
Edit: Would it be possible and more correct to do this with a python headless browser (e.g., Ghost.py)?
It's pretty easy to load a JSON object in Python. Python has a built-in package called json, which can be used to work with JSON data. It's done by using the JSON module, which provides us with a lot of methods which among loads() and load() methods are gonna help us to read the JSON file.
Use the JavaScript function JSON. parse() to convert text into a JavaScript object: const obj = JSON. parse('{"name":"John", "age":30, "city":"New York"}');
BeautifulSoup is an html parser; you also need a javascript parser here. btw, some javascript object literals are not valid json (though in your example the literal is also a valid json object).
In simple cases you could:
<script>
's text using an html parserwindow.blog...
is a single line or there is no ';'
inside the object and extract the javascript object literal using simple string manipulations or a regexExample:
#!/usr/bin/env python
html = """<!doctype html>
<title>extract javascript object as json</title>
<script>
// ..
window.blog.data = {"activity":{"type":"read"}};
// ..
</script>
<p>some other html here
"""
import json
import re
from bs4 import BeautifulSoup # $ pip install beautifulsoup4
soup = BeautifulSoup(html)
script = soup.find('script', text=re.compile('window\.blog\.data'))
json_text = re.search(r'^\s*window\.blog\.data\s*=\s*({.*?})\s*;\s*$',
script.string, flags=re.DOTALL | re.MULTILINE).group(1)
data = json.loads(json_text)
assert data['activity']['type'] == 'read'
If the assumptions are incorrect then the code fails.
To relax the second assumption, a javascript parser could be used instead of a regex e.g., slimit
(suggested by @approximatenumber):
from slimit import ast # $ pip install slimit
from slimit.parser import Parser as JavascriptParser
from slimit.visitors import nodevisitor
soup = BeautifulSoup(html, 'html.parser')
tree = JavascriptParser().parse(soup.script.string)
obj = next(node.right for node in nodevisitor.visit(tree)
if (isinstance(node, ast.Assign) and
node.left.to_ecma() == 'window.blog.data'))
# HACK: easy way to parse the javascript object literal
data = json.loads(obj.to_ecma()) # NOTE: json format may be slightly different
assert data['activity']['type'] == 'read'
There is no need to treat the object literal (obj
) as a json object. To get the necessary info, obj
can be visited recursively like other ast nodes. It would allow to support arbitrary javascript code (that can be parsed by slimit
).
Something like this may work:
import re
HTML = """
<html>
<head>
...
<script type= "text/javascript">
window.blog.data = {"activity":
{"type":"read"}
};
...
</script>
</head>
<body>
...
</body>
</html>
"""
JSON = re.compile('window.blog.data = ({.*?});', re.DOTALL)
matches = JSON.search(HTML)
print matches.group(1)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With