I have a Python script that scrapes the src
attribute of the <video>
element in an html page. With the browser inspector on the video of this page, I can see the video element I need to scrape, but viewing the page source directly only shows the ember application JavaScript files.
What do I need to do to access the "inner frame" markup that holds the <video>
element so I can scrape the src
attribute?
Edited so it's not so broad
No need to go the full browser / selenium route. Just do a bit more investigation and you'll see how it works:
For vine URL https://vine.co/v/i3pQ70vK3iv
, you want the json file which describes the video.
So simple scrape the URL https://archive.vine.co/posts/i3pQ70vK3iv.json
. That will return file like:
{
"username": "Bleacher Report",
"userIdStr": "906307026416705536",
"postId": 1352573572862066700,
"verified": 1,
"description": "😳💯",
"created": "2016-06-09T06:14:43.000000",
"permalinkUrl": "https://vine.co/v/i3pQ70vK3iv",
"userId": 906307026416705500,
"profileBackground": "0x333333",
"vanityUrls": [
"BleacherReport"
],
"entities": [],
"postIdStr": "1352573572862066688",
"comments": 293,
"reposts": 2384,
"videoLowURL": "http://mtc.cdn.vine.co/r/videos_r2/DC69CF91B61352573549554077696_558739dd749.17.0.4126553130190094381.mp4?versionId=oVIxbcFKL5aaqsbMx_q.7wt4zEnhgQ0w",
"loops": 19182516,
"videoUrl": "http://mtc.cdn.vine.co/r/videos/DC69CF91B61352573549554077696_558739dd749.17.0.4126553130190094381.mp4?versionId=av0W8OaLWSzghq.9__iKdSU4y75FDNg.",
"videoDashUrl": "http://mtc.cdn.vine.co/r/videos_dashhd/DC69CF91B61352573549554077696_558739dd749.17.0.4126553130190094381.mp4?versionId=98zVYTYAx16DJka7Oa1yQu20utGrQch9",
"thumbnailUrl": "http://v.cdn.vine.co/r/thumbs/DC69CF91B61352573549554077696_558739dd749.17.0.4126553130190094381.mp4.jpg?versionId=7LmJNEI3C6bsHkF3t9jqu5k1O2xEHo9l",
"explicitContent": 0,
"likes": 6593
}
You'll find the URL for the video itself as the videoUrl
attribute in the returned json.
JS is run on the client to populate the video elements of the page, so you'll need a web driver have the page fully populate to access the elements. You can try selenium:
from selenium import webdriver
driver = webdriver.Firefox()
driver.get("https://vine.co/v/i3pQ70vK3iv")
video = driver.find_element_by_tag_name('video')
print video.get_attribute('src')
driver.close()
If you want to run the driver 'headless' (without a gui) see Is it possible to run selenium (Firefox) web driver without a GUI?
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With