Can't scrape YouTube video's closed captions

Question

I'm trying to scrape a YouTube page for the subtitles. Unfortunately it isn't loading everything upon the request. I'm curious to know where I went wrong.

Query string:

https://www.youtube.com/timedtext_editor?action_mde_edit_form=1&v=Nxb2s2Mv6Pw&lang=en&bl=vmp&forceedit=captions&tab=captions

So I've figured out that this is the unique Url-I.D ... Nxb2s2Mv6Pw and I can just substitute it accordingly.

If I run the code below it doesn't catch the tag <textarea yt-uix-form-input-textarea ...>that I need it to locate.

I'm desperately trying to avoid using Selenium to capture this, as I've got a lot of links I need to iterate through and repeat the process. As you can tell by the code below, I've tried to incorporate a delayed time to wait for the page to load, but nothing.

import os
import codecs
import sys
import requests
from bs4 import BeautifulSoup

channel = 'https://www.youtube.com/timedtext_editor?action_mde_edit_form=1&v=dto4koj5DTA&lang=en'
s = requests.Session()
time.sleep(5)
# s.headers['User-Agent'] = USER_AGENT
r = s.get(channel)
time.sleep(5)
html = r.text
soup = BeautifulSoup(html, 'lxml')

for i in soup.find_all('div'):
    print(i)

Please advise.

Jebby · Accepted Answer

I tried scraping the page using requests and lxml but when iterating over the tags in the script, I could find no subtitles on the page (the textarea tag where the subtitles are do not show up in the script) This is likely because YouTube uses javascript to load the subtitles.

Python's request library does not support javascript. You do however have a few options:

Use selenium for scraping the subtitles (You said you would rather not do this.)
Look into the POST and GET requests through your browser and try sending the request parameters needed to the url you traced the javascript to (May not always work if authentication is or dynamic tokens are used for parameters)
Use youtube-dl to download the subtitles.

(This seems to be the easiest / most reliable way to go about this.)

youtube-dl is a command line utility, but you can also import it according to the docs on github.

There are a couple of ways you could go about this. I will be using the video you pointed to in your post for my examples:

youtube-dl --write-sub --skip-download --sub-lang en https://www.youtube.com/watch?v=Nxb2s2Mv6Pw

With that being said, you can make a function in python to call the command:

import os

def download_subs(video_url, lang="en"):
    cmd = [
        "youtube-dl",
        "--skip-download",
        "--write-sub",
        "--sub-lang",
        lang,
        video_url
    ]

    os.system(" ".join(cmd))


url = "https://www.youtube.com/watch?v=Nxb2s2Mv6Pw"

download_subs(url)

Alternatively, you could import youtube_dl from python directly and use it from there:

import youtube_dl

def download_subs(url, lang="en"):
    opts = {
        "skip_download": True,
        "writesubtitles": "%(name)s.vtt",
        "subtitlelangs": lang
    }

    with youtube_dl.YoutubeDL(opts) as yt:
        yt.download([url])

url = "https://www.youtube.com/watch?v=Nxb2s2Mv6Pw"
download_subs(url)

This creates a file in the working directory named

CNN 'Exposed' In Controversial Secret Video and Anita Sarkeesian's 'Punishment'...-Nxb2s2Mv6Pw.en.vtt

The contents of the file look something like this:

WEBVTT
Kind: captions
Language: en

00:00:00.000 --> 00:00:01.500
You beautiful bastards

00:00:01.500 --> 00:00:07.200
Hope you having a fantastic Tuesday welcome back to the Philip Defranco show and let's just jump into it the first thing

00:00:07.200 --> 00:00:11.519
I want to talk about today one of the most requested stories of the day today is an update on the

00:00:11.889 --> 00:00:13.650
Craziness out of Vidcon yesterday

00:00:13.650 --> 00:00:19.350
Specifically we're talking about creator and panelist Anita Sarkeesian being on a panel calling someone in the crowd

...

...

Can't scrape YouTube video's closed captions

Tags:

python

beautifulsoup

python-requests

web-scraping

M4cJunk13

1 Answers

Use youtube-dl to download the subtitles.

The contents of the file look something like this:

Jebby

Recent Activity

Donate For Us

Can't scrape YouTube video's closed captions

Tags:

python

beautifulsoup

python-requests

web-scraping

M4cJunk13

1 Answers

Use youtube-dl to download the subtitles.

The contents of the file look something like this:

Jebby

Related questions

Recent Activity

Donate For Us