In the html code, Vine has <script type="application/ld+json">
with links to all the videos on the page, how would I got about accessing this JSON?
import requests
from bs4 import BeautifulSoup
url = 'https://vine.co/tags/funny'
source_code = requests.get(url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text, 'html.parser')
parseHTML uses native methods to convert the string to a set of DOM nodes, which can then be inserted into the document. These methods do render all trailing or leading text (even if that's just whitespace).
You can try parsing an HTML file using a XML parser, but it's likely to fail. The reason is that HTML documents can have the following HTML features that XML parsers don't understand. XML parsers will fail to parse any HTML document that uses any of those features.
You can use a css selector:
soup.select("script[type=application/ld+json]")
Or find_all setting type="application/ld+json"
:
soup.find_all("script",type="application/ld+json")
Both gives you:
[<script type="application/ld+json">\n {\n "@context": "http://schema.org",\n "@type": "ItemList",\n "url": "https://vine.co/tags/funny",\n "itemListElement": [\n \n {\n "@type": "ListItem",\n "position": 1,\n "url": "https://vine.co/v/iLKgAXeqwqu"\n },\n \n {\n "@type": "ListItem",\n "position": 2,\n "url": "https://vine.co/v/iLK6p2UHDTl"\n },\n \n {\n "@type": "ListItem",\n "position": 3,\n "url": "https://vine.co/v/iLKrbIeXPTH"\n },\n \n {\n "@type": "ListItem",\n "position": 4,\n "url": "https://vine.co/v/iLKrbZ5zir0"\n },\n \n {\n "@type": "ListItem",\n "position": 5,\n "url": "https://vine.co/v/iLKvxUwLUxr"\n },\n \n {\n "@type": "ListItem",\n "position": 6,\n "url": "https://vine.co/v/iLKvnVOd7VA"\n },\n \n {\n "@type": "ListItem",\n "position": 7,\n "url": "https://vine.co/v/iLKv73UQmjB"\n },\n \n {\n "@type": "ListItem",\n "position": 8,\n "url": "https://vine.co/v/iLKvBeO9Fmt"\n },\n \n {\n "@type": "ListItem",\n "position": 9,\n "url": "https://vine.co/v/iLKnrqMDYeD"\n },\n \n {\n "@type": "ListItem",\n "position": 10,\n "url": "https://vine.co/v/iLKnWrjMqwE"\n },\n \n {\n "@type": "ListItem",\n "position": 11,\n "url": "https://vine.co/v/iLK17Bg1wt0"\n },\n \n {\n "@type": "ListItem",\n "position": 12,\n "url": "https://vine.co/v/iLK5ExAZ7WB"\n },\n \n {\n "@type": "ListItem",\n "position": 13,\n "url": "https://vine.co/v/iLK5Eg7vHM7"\n },\n \n {\n "@type": "ListItem",\n "position": 14,\n "url": "https://vine.co/v/iLKitbix3pb"\n },\n \n {\n "@type": "ListItem",\n "position": 15,\n "url": "https://vine.co/v/iLKOleYJhUp"\n },\n \n {\n "@type": "ListItem",\n "position": 16,\n "url": "https://vine.co/v/iLKOTFgXVFQ"\n },\n \n {\n "@type": "ListItem",\n "position": 17,\n "url": "https://vine.co/v/iLKMI6t91xe"\n },\n \n {\n "@type": "ListItem",\n "position": 18,\n "url": "https://vine.co/v/iLKMX6p0TD6"\n },\n \n {\n "@type": "ListItem",\n "position": 19,\n "url": "https://vine.co/v/iLKM6Hh1nzr"\n },\n \n {\n "@type": "ListItem",\n "position": 20,\n "url": "https://vine.co/v/iLKhQWVIAj3"\n }\n \n ]\n }\n </script>]
To get it into json, all you need is to json.loads the text, also since there is only one, you can use select_one or find:
import requests
from bs4 import BeautifulSoup
import json
url = 'https://vine.co/tags/funny'
source_code = requests.get(url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text, 'html.parser')
# js = json.loads(soup.find("script",type="application/ld+json").text)
js = json.loads(soup.select_one("script[type=application/ld+json]").text)
print(js)
Which gives you:
{u'url': u'https://vine.co/tags/funny', u'@context': u'http://schema.org', u'itemListElement': [{u'url': u'https://vine.co/v/iLKgAXeqwqu', u'position': 1, u'@type': u'ListItem'}, {u'url': u'https://vine.co/v/iLK6p2UHDTl', u'position': 2, u'@type': u'ListItem'}, {u'url': u'https://vine.co/v/iLKrbIeXPTH', u'position': 3, u'@type': u'ListItem'}, {u'url': u'https://vine.co/v/iLKrbZ5zir0', u'position': 4, u'@type': u'ListItem'}, {u'url': u'https://vine.co/v/iLKvxUwLUxr', u'position': 5, u'@type': u'ListItem'}, {u'url': u'https://vine.co/v/iLKvnVOd7VA', u'position': 6, u'@type': u'ListItem'}, {u'url': u'https://vine.co/v/iLKv73UQmjB', u'position': 7, u'@type': u'ListItem'}, {u'url': u'https://vine.co/v/iLKvBeO9Fmt', u'position': 8, u'@type': u'ListItem'}, {u'url': u'https://vine.co/v/iLKnrqMDYeD', u'position': 9, u'@type': u'ListItem'}, {u'url': u'https://vine.co/v/iLKnWrjMqwE', u'position': 10, u'@type': u'ListItem'}, {u'url': u'https://vine.co/v/iLK17Bg1wt0', u'position': 11, u'@type': u'ListItem'}, {u'url': u'https://vine.co/v/iLK5ExAZ7WB', u'position': 12, u'@type': u'ListItem'}, {u'url': u'https://vine.co/v/iLK5Eg7vHM7', u'position': 13, u'@type': u'ListItem'}, {u'url': u'https://vine.co/v/iLKitbix3pb', u'position': 14, u'@type': u'ListItem'}, {u'url': u'https://vine.co/v/iLKOleYJhUp', u'position': 15, u'@type': u'ListItem'}, {u'url': u'https://vine.co/v/iLKOTFgXVFQ', u'position': 16, u'@type': u'ListItem'}, {u'url': u'https://vine.co/v/iLKMI6t91xe', u'position': 17, u'@type': u'ListItem'}, {u'url': u'https://vine.co/v/iLKMX6p0TD6', u'position': 18, u'@type': u'ListItem'}, {u'url': u'https://vine.co/v/iLKM6Hh1nzr', u'position': 19, u'@type': u'ListItem'}, {u'url': u'https://vine.co/v/iLKhQWVIAj3', u'position': 20, u'@type': u'ListItem'}], u'@type': u'ItemList'}
The last step is just to parse js to get the urls, they are in a list of dicts you can access with js["itemListElement"]
:
In [18]: js = json.loads(soup.select_one("script[type=application/ld+json]").text)
In [19]: all_urls = [dct["url"] for dct in js["itemListElement"]]
In [20]: print(all_urls)
['https://vine.co/v/iLK2rbzBU50', 'https://vine.co/v/iLK2iw305nH', 'https://vine.co/v/iLK2AadMMTO', 'https://vine.co/v/iLK2WY1EMWJ', 'https://vine.co/v/iLKQ6AdTtXE', 'https://vine.co/v/iLKQAPtKdwF', 'https://vine.co/v/iLKQAKpVJAM', 'https://vine.co/v/iLKxQqIH65I', 'https://vine.co/v/iLKxAuJwe2v', 'https://vine.co/v/iLKPQhZprq3', 'https://vine.co/v/iLKPIij7EzW', 'https://vine.co/v/iLKU697X3iQ', 'https://vine.co/v/iLKFZDTUHla', 'https://vine.co/v/iLKtPzahtel', 'https://vine.co/v/iLKTbpb1hgO', 'https://vine.co/v/iLKTaKYEx06', 'https://vine.co/v/iLKInbjuAnY', 'https://vine.co/v/iLKIBDbbDHY', 'https://vine.co/v/iLKjPxPz7bK', 'https://vine.co/v/iLKjFzKJwYF']
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With