Extract content of <script> with BeautifulSoup

Tags:

1/ I am trying to extract a part of the script using beautiful soup but it prints Nothing. What's wrong ?

URL = "http://www.reuters.com/video/2014/08/30/woman-who-drank-restaurants-tainted-tea?videoId=341712453" oururl= urllib2.urlopen(URL).read() soup = BeautifulSoup(oururl)  for script in soup("script"):         script.extract()  list_of_scripts = soup.findAll("script") print list_of_scripts

2/ The goal is to extract the value of the attribute "transcript":

<script type="application/ld+json"> {     "@context": "http://schema.org",     "@type": "VideoObject",     "video": {         "@type": "VideoObject",         "headline": "Woman who drank restaurant&#039;s tainted tea hopes for industry...",         "caption": "Woman who drank restaurant&#039;s tainted tea hopes for industry...",           "transcript": "Jan Harding is speaking out for the first time about the ordeal that changed her life.               SOUNDBITE: JAN HARDING, DRANK TAINTED TEA, SAYING:               \"Immediately my whole mouth was on fire.\"               The Utah woman was critically burned in her mouth and esophagus after taking a sip of sweet tea laced with a toxic cleaning solution at Dickey's BBQ.               SOUNDBITE: JAN HARDING, DRANK TAINTED TEA, SAYING:               \"It was like a fire beyond anything you can imagine. I mean, it was not like drinking hot coffee.\"               Authorities say an employee mistakenly mixed the industrial cleaning solution containing lye into the tea thinking it was sugar.               The Hardings hope the incident will bring changes in the restaurant industry to avoid such dangerous mixups.               SOUNDBITE: JIM HARDING, HUSBAND, SAYING:               \"Bottom line, so no one ever has to go through this again.\"               The district attorney's office is expected to decide in the coming week whether criminal charges will be filed.",

783

asked Oct 04 '14 12:10

laiho b

2 Answers

From the documentation:

As of Beautiful Soup version 4.9.0, when lxml or html.parser are in use, the contents of <script>, <style>, and <template> tags are not considered to be ‘text’, since those tags are not part of the human-visible content of the page.

So basically the accepted answer from falsetru above is all good, but use .string instead of .text with newer versions of Beautiful Soup, or you'll be puzzled as I was by .text always returning None for <script> tags.

111

answered Sep 19 '22 22:09

Andrew Richards

extract remove tag from the dom. That's why you get empty list.

Find script with the type="application/ld+json" attribute and decode it using json.loads. Then, you can access the data like Python data structure. (dict for the given data)

import json import urllib2  from bs4 import BeautifulSoup  URL = ("http://www.reuters.com/video/2014/08/30/"        "woman-who-drank-restaurants-tainted-tea?videoId=341712453") oururl= urllib2.urlopen(URL).read() soup = BeautifulSoup(oururl)  data = json.loads(soup.find('script', type='application/ld+json').text) print data['video']['transcript']

answered Sep 21 '22 22:09

falsetru

Related questions
                            
                                How to apply integration tests to a Flask RESTful API
                            
                                Discovering public IP programmatically
                            
                                return eats exception
                            
                                Time series forecasting (eventually with python) [closed]
                            
                                pylint doesn't point to virtualenv python
                            
                                Caffe didn't see hdf5.h when compiling
                            
                                How to sort dictionaries by keys in Python
                            
                                Remove prefix (or suffix) substring from column headers in pandas
                            
                                Is there a short way to check uniqueness of values without using 'if' and multiple 'and's?
                            
                                How to delete all blank lines in the file with the help of python?
                            
                                duplicate output in simple python logging configuration
                            
                                datetime.date(2014, 4, 25) is not JSON serializable in Django [duplicate]
                            
                                Tensorflow GPU Could not load dynamic library 'cusolver64_10.dll'; dlerror: cusolver64_10.dll not found
                            
                                Remove lines that contain certain string
                            
                                Generate a Unique String in Python/Django
                            
                                Python class input argument
                            
                                Nested List and count()
                            
                                filling contours with opencv python
                            
                                Where is pip installed to when using get-pip.py?
                            
                                Ubuntu, how to install OpenCV for python3?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Extract content of <script> with BeautifulSoup

Tags:

python

beautifulsoup

python-2.7

laiho b

People also ask

2 Answers

Andrew Richards

falsetru

Recent Activity

Donate For Us