I'm trying to scrape a YouTube page for the subtitles. Unfortunately it isn't loading everything upon the request. I'm curious to know where I went wrong.
Query string:
https://www.youtube.com/timedtext_editor?action_mde_edit_form=1&v=Nxb2s2Mv6Pw&lang=en&bl=vmp&forceedit=captions&tab=captions
So I've figured out that this is the unique Url-I.D ... Nxb2s2Mv6Pw
and I can just substitute it accordingly.
If I run the code below it doesn't catch the tag <textarea yt-uix-form-input-textarea ...>
that I need it to locate.
I'm desperately trying to avoid using Selenium to capture this, as I've got a lot of links I need to iterate through and repeat the process. As you can tell by the code below, I've tried to incorporate a delayed time to wait for the page to load, but nothing.
import os
import codecs
import sys
import requests
from bs4 import BeautifulSoup
channel = 'https://www.youtube.com/timedtext_editor?action_mde_edit_form=1&v=dto4koj5DTA&lang=en'
s = requests.Session()
time.sleep(5)
# s.headers['User-Agent'] = USER_AGENT
r = s.get(channel)
time.sleep(5)
html = r.text
soup = BeautifulSoup(html, 'lxml')
for i in soup.find_all('div'):
print(i)
Please advise.
I tried scraping the page using requests
and lxml
but when iterating over the tags in the script, I could find no subtitles on the page (the textarea tag where the subtitles are do not show up in the script)
This is likely because YouTube uses javascript to load the subtitles.
Python's request library does not support javascript. You do however have a few options:
Use selenium for scraping the subtitles (You said you would rather not do this.)
Look into the POST and GET requests through your browser and try sending the request parameters needed to the url you traced the javascript to (May not always work if authentication is or dynamic tokens are used for parameters)
(This seems to be the easiest / most reliable way to go about this.)
youtube-dl is a command line utility, but you can also import it according to the docs on github.
There are a couple of ways you could go about this. I will be using the video you pointed to in your post for my examples:
youtube-dl --write-sub --skip-download --sub-lang en https://www.youtube.com/watch?v=Nxb2s2Mv6Pw
With that being said, you can make a function in python to call the command:
import os
def download_subs(video_url, lang="en"):
cmd = [
"youtube-dl",
"--skip-download",
"--write-sub",
"--sub-lang",
lang,
video_url
]
os.system(" ".join(cmd))
url = "https://www.youtube.com/watch?v=Nxb2s2Mv6Pw"
download_subs(url)
Alternatively, you could import youtube_dl
from python directly and use it from there:
import youtube_dl
def download_subs(url, lang="en"):
opts = {
"skip_download": True,
"writesubtitles": "%(name)s.vtt",
"subtitlelangs": lang
}
with youtube_dl.YoutubeDL(opts) as yt:
yt.download([url])
url = "https://www.youtube.com/watch?v=Nxb2s2Mv6Pw"
download_subs(url)
This creates a file in the working directory named
CNN 'Exposed' In Controversial Secret Video and Anita Sarkeesian's 'Punishment'...-Nxb2s2Mv6Pw.en.vtt
WEBVTT
Kind: captions
Language: en
00:00:00.000 --> 00:00:01.500
You beautiful bastards
00:00:01.500 --> 00:00:07.200
Hope you having a fantastic Tuesday welcome back to the Philip Defranco show and let's just jump into it the first thing
00:00:07.200 --> 00:00:11.519
I want to talk about today one of the most requested stories of the day today is an update on the
00:00:11.889 --> 00:00:13.650
Craziness out of Vidcon yesterday
00:00:13.650 --> 00:00:19.350
Specifically we're talking about creator and panelist Anita Sarkeesian being on a panel calling someone in the crowd
...
...
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With