Starting with version 4.9.0 BeautifulSoup4 changed[0] the way text
prop works, now ignoring contents of embedded scripts:
= 4.9.0 (20200405)
...
* Embedded CSS and Javascript is now stored in distinct Stylesheet and
Script tags, which are ignored by methods like get_text() since most
people don't consider this sort of content to be 'text'. This
feature is not supported by the html5lib treebuilder. [bug=1868861]
So now it's no longer possible to extract wanted text
out of html <script>wanted text</script>
using soup.find('script').text
.
What is the preferred way of extracting it now? I'd rather prefer not to remove <script>
and </script>
from str(script)
by hand.
[0] - https://bazaar.launchpad.net/~leonardr/beautifulsoup/bs4/view/head:/CHANGELOG
You could try using the script tag's contents
as follows:
import requests
from bs4 import BeautifulSoup
r = requests.get("https://www.yourwebsite.com")
soup = BeautifulSoup(r.content, "html.parser")
for script in soup.find_all('script'):
if len(script.contents):
print(script.contents[0])
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With