Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Getting attribute's value using BeautifulSoup

I'm writing a python script which will extract the script locations after parsing from a webpage. Lets say there are two scenarios :

<script type="text/javascript" src="http://example.com/something.js"></script>

and

<script>some JS</script>

I'm able to get the JS from the second scenario, that is when the JS is written within the tags.

But is there any way, I could get the value of src from the first scenario (i.e extracting all the values of src tags within script such as http://example.com/something.js)

Here's my code

#!/usr/bin/python

import requests 
from bs4 import BeautifulSoup

r  = requests.get("http://rediff.com/")
data = r.text
soup = BeautifulSoup(data)
for n in soup.find_all('script'):
    print n 

Output : Some JS

Output Needed : http://example.com/something.js

like image 695
aditya.gupta Avatar asked Sep 11 '13 05:09

aditya.gupta


People also ask

How do you get a tag value in Beautiful Soup?

To extract attributes of elements in Beautiful Soup, use the [~] notation. For instance, el["id"] retrieves the value of the id attribute.

How do I extract text from P tags in Beautiful Soup?

Create an HTML document and specify the '<p>' tag into the code. Pass the HTML document into the Beautifulsoup() function. Use the 'P' tag to extract paragraphs from the Beautifulsoup object. Get text from the HTML document with get_text().

How do you get elements in Beautiful Soup?

Beautiful Soup provides "find()" and "find_all()" functions to get the specific data from the HTML file by putting the specific tag in the function. find() function - return the first element of given tag. find_all() function - return the all the element of given tag.

How do I find the element by id in Beautiful Soup?

To extract elements by id in Beautiful Soup: use the find_all(~) method with argument id . use the select(css_selector) method.


2 Answers

It will get all the src values only if they are present. Or else it would skip that <script> tag

from bs4 import BeautifulSoup
import urllib2
url="http://rediff.com/"
page=urllib2.urlopen(url)
soup = BeautifulSoup(page.read())
sources=soup.findAll('script',{"src":True})
for source in sources:
 print source['src']

I am getting following two src values as result

http://imworld.rediff.com/worldrediff/js_2_5/ws-global_hm_1.js
http://im.rediff.com/uim/common/realmedia_banner_1_5.js

I guess this is what you want. Hope this is useful.

like image 76
Venkateshwaran Selvaraj Avatar answered Nov 07 '22 22:11

Venkateshwaran Selvaraj


Get 'src' from script node.

import requests 
from bs4 import BeautifulSoup

r  = requests.get("http://rediff.com/")
data = r.text
soup = BeautifulSoup(data)
for n in soup.find_all('script'):
    print "src:", n.get('src') <==== 
like image 5
rajpy Avatar answered Nov 07 '22 23:11

rajpy