I've been trying some webscraping and I came across some interesting data located inside this tag:
<script type="application/ld+json">
I've been able to isolate that tag using beautiful soup
html = urlopen(url)
soup = BeautifulSoup(html, "lxml")
p = soup.find('script', {'type':'application/ld+json'})
print p
but I haven't been able to work with the data or to extract any data from that tag.
If I try to use regex to get some stuff out of it I get:
TypeError: expected string or buffer
How can I get the data from that script tag and use it like I'd use a dictionary or a string? I'm using python 2.7 by the way.
you should read the html to parse
html = urlopen(url).read()
soup = BeautifulSoup(html, "html.parser")
p = soup.find('script', {'type':'application/ld+json'})
print p.contents
You should read the JSON with json.loads
to convert it into a dictionary.
import json
import requests
from bs4 import BeautifulSoup
def get_ld_json(url: str) -> dict:
parser = "html.parser"
req = requests.get(url)
soup = BeautifulSoup(req.text, parser)
return json.loads("".join(soup.find("script", {"type":"application/ld+json"}).contents))
The join
/ contents
combination removes the script tags.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With