How to parse ld+json using python

Question

I've been trying some webscraping and I came across some interesting data located inside this tag:

<script type="application/ld+json">

I've been able to isolate that tag using beautiful soup

html = urlopen(url)
soup = BeautifulSoup(html, "lxml")

p = soup.find('script', {'type':'application/ld+json'})
print p

but I haven't been able to work with the data or to extract any data from that tag.

If I try to use regex to get some stuff out of it I get:

TypeError: expected string or buffer

How can I get the data from that script tag and use it like I'd use a dictionary or a string? I'm using python 2.7 by the way.

Pavan Kumar T S · Accepted Answer

you should read the html to parse

html = urlopen(url).read()
soup = BeautifulSoup(html, "html.parser")
p = soup.find('script', {'type':'application/ld+json'})
print p.contents

Mark Chackerian · Answer

You should read the JSON with json.loads to convert it into a dictionary.

import json

import requests
from bs4 import BeautifulSoup

def get_ld_json(url: str) -> dict:
    parser = "html.parser"
    req = requests.get(url)
    soup = BeautifulSoup(req.text, parser)
    return json.loads("".join(soup.find("script", {"type":"application/ld+json"}).contents))

The join / contents combination removes the script tags.

How to parse ld+json using python

Tags:

python

json

web-scraping

json-ld

wessells

2 Answers

Pavan Kumar T S

Mark Chackerian

Recent Activity

Donate For Us

How to parse ld+json using python

Tags:

python

json

web-scraping

json-ld

wessells

2 Answers

Pavan Kumar T S

Mark Chackerian

Related questions

Recent Activity

Donate For Us