Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to parse ld+json using python

I've been trying some webscraping and I came across some interesting data located inside this tag:

<script type="application/ld+json">

I've been able to isolate that tag using beautiful soup

html = urlopen(url)
soup = BeautifulSoup(html, "lxml")

p = soup.find('script', {'type':'application/ld+json'})
print p

but I haven't been able to work with the data or to extract any data from that tag.

If I try to use regex to get some stuff out of it I get:

TypeError: expected string or buffer

How can I get the data from that script tag and use it like I'd use a dictionary or a string? I'm using python 2.7 by the way.

like image 268
wessells Avatar asked Apr 27 '17 10:04

wessells


2 Answers

you should read the html to parse

html = urlopen(url).read()
soup = BeautifulSoup(html, "html.parser")
p = soup.find('script', {'type':'application/ld+json'})
print p.contents
like image 121
Pavan Kumar T S Avatar answered Sep 23 '22 02:09

Pavan Kumar T S


You should read the JSON with json.loads to convert it into a dictionary.

import json

import requests
from bs4 import BeautifulSoup

def get_ld_json(url: str) -> dict:
    parser = "html.parser"
    req = requests.get(url)
    soup = BeautifulSoup(req.text, parser)
    return json.loads("".join(soup.find("script", {"type":"application/ld+json"}).contents))

The join / contents combination removes the script tags.

like image 30
Mark Chackerian Avatar answered Sep 26 '22 02:09

Mark Chackerian