Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Extract script contents with BeautifulSoup (4.9.0)

Starting with version 4.9.0 BeautifulSoup4 changed[0] the way text prop works, now ignoring contents of embedded scripts:

= 4.9.0 (20200405)
...
* Embedded CSS and Javascript is now stored in distinct Stylesheet and
  Script tags, which are ignored by methods like get_text() since most
  people don't consider this sort of content to be 'text'. This
  feature is not supported by the html5lib treebuilder. [bug=1868861]

So now it's no longer possible to extract wanted text out of html <script>wanted text</script> using soup.find('script').text.

What is the preferred way of extracting it now? I'd rather prefer not to remove <script> and </script> from str(script) by hand.

[0] - https://bazaar.launchpad.net/~leonardr/beautifulsoup/bs4/view/head:/CHANGELOG

like image 433
Kyryl Havrylenko Avatar asked Oct 15 '22 04:10

Kyryl Havrylenko


1 Answers

You could try using the script tag's contents as follows:

import requests
from bs4 import BeautifulSoup

r = requests.get("https://www.yourwebsite.com")
soup = BeautifulSoup(r.content, "html.parser")

for script in soup.find_all('script'):
    if len(script.contents):
        print(script.contents[0])
like image 112
Martin Evans Avatar answered Oct 31 '22 02:10

Martin Evans