Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Scrape inner frame HTML

I have a Python script that scrapes the src attribute of the <video> element in an html page. With the browser inspector on the video of this page, I can see the video element I need to scrape, but viewing the page source directly only shows the ember application JavaScript files.

What do I need to do to access the "inner frame" markup that holds the <video> element so I can scrape the src attribute?

Edited so it's not so broad

like image 790
O P Avatar asked Feb 08 '17 00:02

O P


2 Answers

No need to go the full browser / selenium route. Just do a bit more investigation and you'll see how it works:

For vine URL https://vine.co/v/i3pQ70vK3iv, you want the json file which describes the video.

So simple scrape the URL https://archive.vine.co/posts/i3pQ70vK3iv.json. That will return file like:

{
  "username": "Bleacher Report",
  "userIdStr": "906307026416705536",
  "postId": 1352573572862066700,
  "verified": 1,
  "description": "😳💯",
  "created": "2016-06-09T06:14:43.000000",
  "permalinkUrl": "https://vine.co/v/i3pQ70vK3iv",
  "userId": 906307026416705500,
  "profileBackground": "0x333333",
  "vanityUrls": [
    "BleacherReport"
  ],
  "entities": [],
  "postIdStr": "1352573572862066688",
  "comments": 293,
  "reposts": 2384,
  "videoLowURL": "http://mtc.cdn.vine.co/r/videos_r2/DC69CF91B61352573549554077696_558739dd749.17.0.4126553130190094381.mp4?versionId=oVIxbcFKL5aaqsbMx_q.7wt4zEnhgQ0w",
  "loops": 19182516,
  "videoUrl": "http://mtc.cdn.vine.co/r/videos/DC69CF91B61352573549554077696_558739dd749.17.0.4126553130190094381.mp4?versionId=av0W8OaLWSzghq.9__iKdSU4y75FDNg.",
  "videoDashUrl": "http://mtc.cdn.vine.co/r/videos_dashhd/DC69CF91B61352573549554077696_558739dd749.17.0.4126553130190094381.mp4?versionId=98zVYTYAx16DJka7Oa1yQu20utGrQch9",
  "thumbnailUrl": "http://v.cdn.vine.co/r/thumbs/DC69CF91B61352573549554077696_558739dd749.17.0.4126553130190094381.mp4.jpg?versionId=7LmJNEI3C6bsHkF3t9jqu5k1O2xEHo9l",
  "explicitContent": 0,
  "likes": 6593
}

You'll find the URL for the video itself as the videoUrl attribute in the returned json.

like image 169
pbuck Avatar answered Oct 04 '22 02:10

pbuck


JS is run on the client to populate the video elements of the page, so you'll need a web driver have the page fully populate to access the elements. You can try selenium:

from selenium import webdriver

driver = webdriver.Firefox()
driver.get("https://vine.co/v/i3pQ70vK3iv")
video = driver.find_element_by_tag_name('video')
print video.get_attribute('src')
driver.close()

If you want to run the driver 'headless' (without a gui) see Is it possible to run selenium (Firefox) web driver without a GUI?

like image 23
TheoretiCAL Avatar answered Oct 04 '22 00:10

TheoretiCAL