Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Extract content of <script> with BeautifulSoup

1/ I am trying to extract a part of the script using beautiful soup but it prints Nothing. What's wrong ?

URL = "http://www.reuters.com/video/2014/08/30/woman-who-drank-restaurants-tainted-tea?videoId=341712453" oururl= urllib2.urlopen(URL).read() soup = BeautifulSoup(oururl)  for script in soup("script"):         script.extract()  list_of_scripts = soup.findAll("script") print list_of_scripts 

2/ The goal is to extract the value of the attribute "transcript":

<script type="application/ld+json"> {     "@context": "http://schema.org",     "@type": "VideoObject",     "video": {         "@type": "VideoObject",         "headline": "Woman who drank restaurant&#039;s tainted tea hopes for industry...",         "caption": "Woman who drank restaurant&#039;s tainted tea hopes for industry...",           "transcript": "Jan Harding is speaking out for the first time about the ordeal that changed her life.               SOUNDBITE: JAN HARDING, DRANK TAINTED TEA, SAYING:               \"Immediately my whole mouth was on fire.\"               The Utah woman was critically burned in her mouth and esophagus after taking a sip of sweet tea laced with a toxic cleaning solution at Dickey's BBQ.               SOUNDBITE: JAN HARDING, DRANK TAINTED TEA, SAYING:               \"It was like a fire beyond anything you can imagine. I mean, it was not like drinking hot coffee.\"               Authorities say an employee mistakenly mixed the industrial cleaning solution containing lye into the tea thinking it was sugar.               The Hardings hope the incident will bring changes in the restaurant industry to avoid such dangerous mixups.               SOUNDBITE: JIM HARDING, HUSBAND, SAYING:               \"Bottom line, so no one ever has to go through this again.\"               The district attorney's office is expected to decide in the coming week whether criminal charges will be filed.", 
like image 783
laiho b Avatar asked Oct 04 '14 12:10

laiho b


People also ask

How do I find a specific text in BeautifulSoup?

To find elements that contain a specific text in Beautiful Soup, we can use find_all(~) method together with a lambda function.

Does BeautifulSoup work with Javascript?

Beautiful Soup doesn't mimic a client. Javascript is code that runs on the client. With Python, we simply make a request to the server, and get the server's response, which is the starting text, along of course with the javascript, but it's the browser that reads and runs that javascript.


2 Answers

From the documentation:

As of Beautiful Soup version 4.9.0, when lxml or html.parser are in use, the contents of <script>, <style>, and <template> tags are not considered to be ‘text’, since those tags are not part of the human-visible content of the page.

So basically the accepted answer from falsetru above is all good, but use .string instead of .text with newer versions of Beautiful Soup, or you'll be puzzled as I was by .text always returning None for <script> tags.

like image 111
Andrew Richards Avatar answered Sep 19 '22 22:09

Andrew Richards


extract remove tag from the dom. That's why you get empty list.


Find script with the type="application/ld+json" attribute and decode it using json.loads. Then, you can access the data like Python data structure. (dict for the given data)

import json import urllib2  from bs4 import BeautifulSoup  URL = ("http://www.reuters.com/video/2014/08/30/"        "woman-who-drank-restaurants-tainted-tea?videoId=341712453") oururl= urllib2.urlopen(URL).read() soup = BeautifulSoup(oururl)  data = json.loads(soup.find('script', type='application/ld+json').text) print data['video']['transcript'] 
like image 24
falsetru Avatar answered Sep 21 '22 22:09

falsetru