Beutifulsoup to extract all external resources from html

Question

I am looking to identify the urls that request external resources in html files.

I currently use the scr attribute in the img and script tags, and the href attribute in the link tag (to identify css).

Are there other tags that I should be examining to identify other resources?

For reference, my code in Python is currently:

html = read_in_file(file)
soup = BeautifulSoup(html)
image_scr = [x['src'] for x in soup.findAll('img')]
css_link = [x['href'] for x in soup.findAll('link')]
scipt_src = []   ## Often times script doesn't have attributes 'src' hence need for try/except
for x in soup.findAll('script'):
    try:
        scipt_src.append(x['src'])
    except KeyError:
        pass

kyrenia · Accepted Answer

Updated my code to capture what seemed like the most common resources in html code. Obviously this doesn't look at resources requested in either CSS or Javascript. If I am missing tags please comment.

from bs4 import BeautifulSoup 
def find_list_resources (tag, attribute,soup):
   list = []
   for x in soup.findAll(tag):
       try:
           list.append(x[attribute])
       except KeyError:
           pass
   return(list)

html = read_in_file(file)
soup = BeautifulSoup(html)

image_scr = find_list_resources('img',"src",soup)   
scipt_src = find_list_resources('script',"src",soup)    
css_link = find_list_resources("link","href",soup)
video_src = find_list_resources("video","src",soup)         
audio_src = find_list_resources("audio","src",soup) 
iframe_src = find_list_resources("iframe","src",soup)
embed_src = find_list_resources("embed","src",soup)
object_data = find_list_resources("object","data",soup)         
soruce_src = find_list_resources("source","src",soup)

Beutifulsoup to extract all external resources from html

Tags:

python

html

beautifulsoup

kyrenia

1 Answers

kyrenia

Recent Activity

Donate For Us

Beutifulsoup to extract all external resources from html

Tags:

python

html

beautifulsoup

kyrenia

1 Answers

kyrenia

Related questions

Recent Activity

Donate For Us