Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Can't scrape YouTube video's closed captions

I'm trying to scrape a YouTube page for the subtitles. Unfortunately it isn't loading everything upon the request. I'm curious to know where I went wrong.

Query string:

https://www.youtube.com/timedtext_editor?action_mde_edit_form=1&v=Nxb2s2Mv6Pw&lang=en&bl=vmp&forceedit=captions&tab=captions

So I've figured out that this is the unique Url-I.D ... Nxb2s2Mv6Pw and I can just substitute it accordingly.

If I run the code below it doesn't catch the tag <textarea yt-uix-form-input-textarea ...>that I need it to locate.

I'm desperately trying to avoid using Selenium to capture this, as I've got a lot of links I need to iterate through and repeat the process. As you can tell by the code below, I've tried to incorporate a delayed time to wait for the page to load, but nothing.

import os
import codecs
import sys
import requests
from bs4 import BeautifulSoup

channel = 'https://www.youtube.com/timedtext_editor?action_mde_edit_form=1&v=dto4koj5DTA&lang=en'
s = requests.Session()
time.sleep(5)
# s.headers['User-Agent'] = USER_AGENT
r = s.get(channel)
time.sleep(5)
html = r.text
soup = BeautifulSoup(html, 'lxml')

for i in soup.find_all('div'):
    print(i)

Please advise.

like image 916
M4cJunk13 Avatar asked Jan 06 '18 07:01

M4cJunk13


1 Answers

I tried scraping the page using requests and lxml but when iterating over the tags in the script, I could find no subtitles on the page (the textarea tag where the subtitles are do not show up in the script) This is likely because YouTube uses javascript to load the subtitles.

Python's request library does not support javascript. You do however have a few options:

  • Use selenium for scraping the subtitles (You said you would rather not do this.)

  • Look into the POST and GET requests through your browser and try sending the request parameters needed to the url you traced the javascript to (May not always work if authentication is or dynamic tokens are used for parameters)

  • Use youtube-dl to download the subtitles.

    (This seems to be the easiest / most reliable way to go about this.)

youtube-dl is a command line utility, but you can also import it according to the docs on github.

There are a couple of ways you could go about this. I will be using the video you pointed to in your post for my examples:

youtube-dl --write-sub --skip-download --sub-lang en https://www.youtube.com/watch?v=Nxb2s2Mv6Pw

With that being said, you can make a function in python to call the command:

import os

def download_subs(video_url, lang="en"):
    cmd = [
        "youtube-dl",
        "--skip-download",
        "--write-sub",
        "--sub-lang",
        lang,
        video_url
    ]

    os.system(" ".join(cmd))


url = "https://www.youtube.com/watch?v=Nxb2s2Mv6Pw"

download_subs(url)

Alternatively, you could import youtube_dl from python directly and use it from there:

import youtube_dl

def download_subs(url, lang="en"):
    opts = {
        "skip_download": True,
        "writesubtitles": "%(name)s.vtt",
        "subtitlelangs": lang
    }

    with youtube_dl.YoutubeDL(opts) as yt:
        yt.download([url])

url = "https://www.youtube.com/watch?v=Nxb2s2Mv6Pw"
download_subs(url)

This creates a file in the working directory named

CNN 'Exposed' In Controversial Secret Video and Anita Sarkeesian's 'Punishment'...-Nxb2s2Mv6Pw.en.vtt

The contents of the file look something like this:

WEBVTT
Kind: captions
Language: en

00:00:00.000 --> 00:00:01.500
You beautiful bastards

00:00:01.500 --> 00:00:07.200
Hope you having a fantastic Tuesday welcome back to the Philip Defranco show and let's just jump into it the first thing

00:00:07.200 --> 00:00:11.519
I want to talk about today one of the most requested stories of the day today is an update on the

00:00:11.889 --> 00:00:13.650
Craziness out of Vidcon yesterday

00:00:13.650 --> 00:00:19.350
Specifically we're talking about creator and panelist Anita Sarkeesian being on a panel calling someone in the crowd

...

...
like image 82
Jebby Avatar answered Oct 11 '22 01:10

Jebby