How to extract a JSON object that was defined in a HTML page javascript block using Python?

Tags:

I am downloading HTML pages that have data defined in them in the following way:

... <script type= "text/javascript">    window.blog.data = {"activity":{"type":"read"}}; </script> ...

I would like to extract the JSON object defined in 'window.blog.data'. Is there a simpler way than parsing it manually? (I am looking into Beautiful Soap but can't seem to find a method that will return the exact object without parsing)

Thanks

Edit: Would it be possible and more correct to do this with a python headless browser (e.g., Ghost.py)?

638

asked Nov 10 '12 16:11

user971956

2 Answers

BeautifulSoup is an html parser; you also need a javascript parser here. btw, some javascript object literals are not valid json (though in your example the literal is also a valid json object).

In simple cases you could:

extract <script>'s text using an html parser
assume that window.blog... is a single line or there is no ';' inside the object and extract the javascript object literal using simple string manipulations or a regex
assume that the string is a valid json and parse it using json module

Example:

#!/usr/bin/env python
html = """<!doctype html>
<title>extract javascript object as json</title>
<script>
// ..
window.blog.data = {"activity":{"type":"read"}};
// ..
</script>
<p>some other html here
"""
import json
import re
from bs4 import BeautifulSoup  # $ pip install beautifulsoup4
soup = BeautifulSoup(html)
script = soup.find('script', text=re.compile('window\.blog\.data'))
json_text = re.search(r'^\s*window\.blog\.data\s*=\s*({.*?})\s*;\s*$',
                      script.string, flags=re.DOTALL | re.MULTILINE).group(1)
data = json.loads(json_text)
assert data['activity']['type'] == 'read'

If the assumptions are incorrect then the code fails.

To relax the second assumption, a javascript parser could be used instead of a regex e.g., slimit (suggested by @approximatenumber):

from slimit import ast  # $ pip install slimit
from slimit.parser import Parser as JavascriptParser
from slimit.visitors import nodevisitor

soup = BeautifulSoup(html, 'html.parser')
tree = JavascriptParser().parse(soup.script.string)
obj = next(node.right for node in nodevisitor.visit(tree)
           if (isinstance(node, ast.Assign) and
               node.left.to_ecma() == 'window.blog.data'))
# HACK: easy way to parse the javascript object literal
data = json.loads(obj.to_ecma())  # NOTE: json format may be slightly different
assert data['activity']['type'] == 'read'

There is no need to treat the object literal (obj) as a json object. To get the necessary info, obj can be visited recursively like other ast nodes. It would allow to support arbitrary javascript code (that can be parsed by slimit).

170

answered Sep 17 '22 18:09

jfs

Something like this may work:

import re

HTML = """ 
<html>
    <head>
    ...
    <script type= "text/javascript"> 
window.blog.data = {"activity":
    {"type":"read"}
    };
    ...
    </script> 
    </head>
    <body>
    ...
    </body>
    </html>
"""

JSON = re.compile('window.blog.data = ({.*?});', re.DOTALL)

matches = JSON.search(HTML)

print matches.group(1)

answered Sep 17 '22 18:09

Christian Thieme

Related questions
                            
                                How are python's unpacking operators * and ** used?
                            
                                Flatten numpy array with sub-arrays of different dimensions
                            
                                Difference between Context Managers and Decorators in Python
                            
                                Poetry and PyTorch
                            
                                re.findall('(ab|cd)', string) vs re.findall('(ab|cd)+', string)
                            
                                How to concat multiple Pandas DataFrame columns with different token separator?
                            
                                Pandas check if value in one multiindex column is in any column, same row of different multiindex
                            
                                Gunicorn worker terminated with signal 9
                            
                                Are Python list comprehensions the same thing as map/grep in Perl?
                            
                                Django - accessing the RequestContext from within a custom filter
                            
                                Advice on Python Parser Generators
                            
                                How do I get rid of the "u" from a decoded JSON object?
                            
                                SQLAlchemy circular dependency - how to solve it?
                            
                                IP address by Domain Name
                            
                                Don't parse options after the last positional argument
                            
                                psycopg - Get formatted sql instead of executing
                            
                                How do I import a module from a parent directory? (unittest purposes)
                            
                                Construct a tree from list os file paths (Python) - Performance dependent
                            
                                How to set the margins for a matplotlib figure?
                            
                                Implementing python slice notation

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to extract a JSON object that was defined in a HTML page javascript block using Python?

Tags:

python

html-parsing

beautifulsoup

headless-browser

user971956

People also ask

2 Answers

jfs

Christian Thieme

Recent Activity

Donate For Us