I'm writing a python script which will extract the script locations after parsing from a webpage. Lets say there are two scenarios :
<script type="text/javascript" src="http://example.com/something.js"></script>
and
<script>some JS</script>
I'm able to get the JS from the second scenario, that is when the JS is written within the tags.
But is there any way, I could get the value of src from the first scenario (i.e extracting all the values of src tags within script such as http://example.com/something.js)
Here's my code
#!/usr/bin/python
import requests
from bs4 import BeautifulSoup
r = requests.get("http://rediff.com/")
data = r.text
soup = BeautifulSoup(data)
for n in soup.find_all('script'):
print n
Output : Some JS
Output Needed : http://example.com/something.js
To extract attributes of elements in Beautiful Soup, use the [~] notation. For instance, el["id"] retrieves the value of the id attribute.
Create an HTML document and specify the '<p>' tag into the code. Pass the HTML document into the Beautifulsoup() function. Use the 'P' tag to extract paragraphs from the Beautifulsoup object. Get text from the HTML document with get_text().
Beautiful Soup provides "find()" and "find_all()" functions to get the specific data from the HTML file by putting the specific tag in the function. find() function - return the first element of given tag. find_all() function - return the all the element of given tag.
To extract elements by id in Beautiful Soup: use the find_all(~) method with argument id . use the select(css_selector) method.
It will get all the src
values only if they are present. Or else it would skip that <script>
tag
from bs4 import BeautifulSoup
import urllib2
url="http://rediff.com/"
page=urllib2.urlopen(url)
soup = BeautifulSoup(page.read())
sources=soup.findAll('script',{"src":True})
for source in sources:
print source['src']
I am getting following two src
values as result
http://imworld.rediff.com/worldrediff/js_2_5/ws-global_hm_1.js
http://im.rediff.com/uim/common/realmedia_banner_1_5.js
I guess this is what you want. Hope this is useful.
Get 'src' from script node.
import requests
from bs4 import BeautifulSoup
r = requests.get("http://rediff.com/")
data = r.text
soup = BeautifulSoup(data)
for n in soup.find_all('script'):
print "src:", n.get('src') <====
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With