Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Parsing html for specific script type

In the html code, Vine has <script type="application/ld+json"> with links to all the videos on the page, how would I got about accessing this JSON?

import requests
from bs4 import BeautifulSoup

url = 'https://vine.co/tags/funny'
source_code = requests.get(url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text, 'html.parser')
like image 232
user6162407 Avatar asked Jun 11 '16 19:06

user6162407


People also ask

What is parseHTML in JavaScript?

parseHTML uses native methods to convert the string to a set of DOM nodes, which can then be inserted into the document. These methods do render all trailing or leading text (even if that's just whitespace).

Can HTML be parsed as XML?

You can try parsing an HTML file using a XML parser, but it's likely to fail. The reason is that HTML documents can have the following HTML features that XML parsers don't understand. XML parsers will fail to parse any HTML document that uses any of those features.


1 Answers

You can use a css selector:

soup.select("script[type=application/ld+json]")

Or find_all setting type="application/ld+json":

soup.find_all("script",type="application/ld+json")

Both gives you:

[<script type="application/ld+json">\n          {\n            "@context": "http://schema.org",\n            "@type": "ItemList",\n            "url": "https://vine.co/tags/funny",\n            "itemListElement": [\n              \n              {\n                "@type": "ListItem",\n                "position": 1,\n                "url": "https://vine.co/v/iLKgAXeqwqu"\n              },\n              \n              {\n                "@type": "ListItem",\n                "position": 2,\n                "url": "https://vine.co/v/iLK6p2UHDTl"\n              },\n              \n              {\n                "@type": "ListItem",\n                "position": 3,\n                "url": "https://vine.co/v/iLKrbIeXPTH"\n              },\n              \n              {\n                "@type": "ListItem",\n                "position": 4,\n                "url": "https://vine.co/v/iLKrbZ5zir0"\n              },\n              \n              {\n                "@type": "ListItem",\n                "position": 5,\n                "url": "https://vine.co/v/iLKvxUwLUxr"\n              },\n              \n              {\n                "@type": "ListItem",\n                "position": 6,\n                "url": "https://vine.co/v/iLKvnVOd7VA"\n              },\n              \n              {\n                "@type": "ListItem",\n                "position": 7,\n                "url": "https://vine.co/v/iLKv73UQmjB"\n              },\n              \n              {\n                "@type": "ListItem",\n                "position": 8,\n                "url": "https://vine.co/v/iLKvBeO9Fmt"\n              },\n              \n              {\n                "@type": "ListItem",\n                "position": 9,\n                "url": "https://vine.co/v/iLKnrqMDYeD"\n              },\n              \n              {\n                "@type": "ListItem",\n                "position": 10,\n                "url": "https://vine.co/v/iLKnWrjMqwE"\n              },\n              \n              {\n                "@type": "ListItem",\n                "position": 11,\n                "url": "https://vine.co/v/iLK17Bg1wt0"\n              },\n              \n              {\n                "@type": "ListItem",\n                "position": 12,\n                "url": "https://vine.co/v/iLK5ExAZ7WB"\n              },\n              \n              {\n                "@type": "ListItem",\n                "position": 13,\n                "url": "https://vine.co/v/iLK5Eg7vHM7"\n              },\n              \n              {\n                "@type": "ListItem",\n                "position": 14,\n                "url": "https://vine.co/v/iLKitbix3pb"\n              },\n              \n              {\n                "@type": "ListItem",\n                "position": 15,\n                "url": "https://vine.co/v/iLKOleYJhUp"\n              },\n              \n              {\n                "@type": "ListItem",\n                "position": 16,\n                "url": "https://vine.co/v/iLKOTFgXVFQ"\n              },\n              \n              {\n                "@type": "ListItem",\n                "position": 17,\n                "url": "https://vine.co/v/iLKMI6t91xe"\n              },\n              \n              {\n                "@type": "ListItem",\n                "position": 18,\n                "url": "https://vine.co/v/iLKMX6p0TD6"\n              },\n              \n              {\n                "@type": "ListItem",\n                "position": 19,\n                "url": "https://vine.co/v/iLKM6Hh1nzr"\n              },\n              \n              {\n                "@type": "ListItem",\n                "position": 20,\n                "url": "https://vine.co/v/iLKhQWVIAj3"\n              }\n              \n            ]\n          }\n        </script>]

To get it into json, all you need is to json.loads the text, also since there is only one, you can use select_one or find:

import requests
from bs4 import BeautifulSoup
import json
url = 'https://vine.co/tags/funny'
source_code = requests.get(url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text, 'html.parser')
# js = json.loads(soup.find("script",type="application/ld+json").text)
js = json.loads(soup.select_one("script[type=application/ld+json]").text)
print(js)

Which gives you:

{u'url': u'https://vine.co/tags/funny', u'@context': u'http://schema.org', u'itemListElement': [{u'url': u'https://vine.co/v/iLKgAXeqwqu', u'position': 1, u'@type': u'ListItem'}, {u'url': u'https://vine.co/v/iLK6p2UHDTl', u'position': 2, u'@type': u'ListItem'}, {u'url': u'https://vine.co/v/iLKrbIeXPTH', u'position': 3, u'@type': u'ListItem'}, {u'url': u'https://vine.co/v/iLKrbZ5zir0', u'position': 4, u'@type': u'ListItem'}, {u'url': u'https://vine.co/v/iLKvxUwLUxr', u'position': 5, u'@type': u'ListItem'}, {u'url': u'https://vine.co/v/iLKvnVOd7VA', u'position': 6, u'@type': u'ListItem'}, {u'url': u'https://vine.co/v/iLKv73UQmjB', u'position': 7, u'@type': u'ListItem'}, {u'url': u'https://vine.co/v/iLKvBeO9Fmt', u'position': 8, u'@type': u'ListItem'}, {u'url': u'https://vine.co/v/iLKnrqMDYeD', u'position': 9, u'@type': u'ListItem'}, {u'url': u'https://vine.co/v/iLKnWrjMqwE', u'position': 10, u'@type': u'ListItem'}, {u'url': u'https://vine.co/v/iLK17Bg1wt0', u'position': 11, u'@type': u'ListItem'}, {u'url': u'https://vine.co/v/iLK5ExAZ7WB', u'position': 12, u'@type': u'ListItem'}, {u'url': u'https://vine.co/v/iLK5Eg7vHM7', u'position': 13, u'@type': u'ListItem'}, {u'url': u'https://vine.co/v/iLKitbix3pb', u'position': 14, u'@type': u'ListItem'}, {u'url': u'https://vine.co/v/iLKOleYJhUp', u'position': 15, u'@type': u'ListItem'}, {u'url': u'https://vine.co/v/iLKOTFgXVFQ', u'position': 16, u'@type': u'ListItem'}, {u'url': u'https://vine.co/v/iLKMI6t91xe', u'position': 17, u'@type': u'ListItem'}, {u'url': u'https://vine.co/v/iLKMX6p0TD6', u'position': 18, u'@type': u'ListItem'}, {u'url': u'https://vine.co/v/iLKM6Hh1nzr', u'position': 19, u'@type': u'ListItem'}, {u'url': u'https://vine.co/v/iLKhQWVIAj3', u'position': 20, u'@type': u'ListItem'}], u'@type': u'ItemList'}

The last step is just to parse js to get the urls, they are in a list of dicts you can access with js["itemListElement"] :

In [18]: js = json.loads(soup.select_one("script[type=application/ld+json]").text)

In [19]: all_urls = [dct["url"] for dct in js["itemListElement"]]

In [20]: print(all_urls)
['https://vine.co/v/iLK2rbzBU50', 'https://vine.co/v/iLK2iw305nH', 'https://vine.co/v/iLK2AadMMTO', 'https://vine.co/v/iLK2WY1EMWJ', 'https://vine.co/v/iLKQ6AdTtXE', 'https://vine.co/v/iLKQAPtKdwF', 'https://vine.co/v/iLKQAKpVJAM', 'https://vine.co/v/iLKxQqIH65I', 'https://vine.co/v/iLKxAuJwe2v', 'https://vine.co/v/iLKPQhZprq3', 'https://vine.co/v/iLKPIij7EzW', 'https://vine.co/v/iLKU697X3iQ', 'https://vine.co/v/iLKFZDTUHla', 'https://vine.co/v/iLKtPzahtel', 'https://vine.co/v/iLKTbpb1hgO', 'https://vine.co/v/iLKTaKYEx06', 'https://vine.co/v/iLKInbjuAnY', 'https://vine.co/v/iLKIBDbbDHY', 'https://vine.co/v/iLKjPxPz7bK', 'https://vine.co/v/iLKjFzKJwYF']
like image 65
Padraic Cunningham Avatar answered Sep 30 '22 22:09

Padraic Cunningham