Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to find all anchor tags inside a div using Beautifulsoup in Python

So this is how my HTML looks that I'm parsing. It is all within a table and gets repeated multiple times and I just want the href attribute value that is inside the div with the attribute class="Special_Div_Name". All these divs are then inside table rows and there are lots of rows.

<tr>
   <div class="Special_Div_Name">
      <a href="something.mp3">text</a>
   </div>
</tr>

What I want is only the href attribute values that end in ".mp3" that are inside the div with the attribute class="Special_Div_Name".

So far I was able to come up with this code:

download = soup.find_all('a', href = re.compile('.mp3'))
for text in download:
    hrefText = (text['href'])
    print hrefText

This code currently prints off every href attribute value on the page that ends in ".mp3" and it's very close to doing exactly what I want. Its just I only want the ".mp3"s that are inside that div class.

like image 339
ddschmitz Avatar asked Feb 18 '16 01:02

ddschmitz


People also ask

How do I get all anchor tags in Python?

To fetch all the elements having <anchor> tagname, we shall use the method find_elements_by_tag_name(). It will fetch a list of elements of anchor tag name as given in the method argument. If there is no matching tagname in the page, an empty list shall be returned.

How do I find all the anchor elements on a website?

To find the anchor elements in a particular web page, we need to open the source of web page by using the browser. After that, you can click ctrl+u. Then, you can copy the source code in the text and also click ctrl+h. It is a simple way to find the anchor text.


2 Answers

This minor adjustment should get you what you want:

special_divs = soup.find_all('div',{'class':'Special_Div_Name'})
for text in special_divs:
    download = text.find_all('a', href = re.compile('\.mp3$'))
    for text in download:
        hrefText = (text['href'])
        print hrefText
like image 52
rofls Avatar answered Oct 29 '22 15:10

rofls


Since Beautiful Soup accepts most CSS selectors with the .select() method, I'd suggest using the attribute selector [href$=".mp3"] in order to select a elements with an href attribute ending with .mp3.

Then you can just prepend the selector .Special_Div_Name in order to only select anchor elements that are descendants:

for a in soup.select('div.Special_Div_Name a[href$=".mp3"]'):
    print (a['href'])

In a more general case, if you would just like to select a elements with an [href] attribute that are a descendant of a div element, then you would use the selector div a[href]:

for a in soup.select('div a[href]'):
    print (a)

If you don't use the code above, then based on the original code that you provided, you would need to select all the elements with a class of Special_Div_Name, then you would need to iterate over those elements and select the descendant anchor elements:

for div in soup.select('.Special_Div_Name'):
    for a in div.find_all('a', href = re.compile('\.mp3$')):
        print (a['href'])

As a side note, re.compile('.mp3') should be re.compile('\.mp3$') since . has special meaning in a regular expression. In addition, you will also want the anchor $ in order to match at the end of the sting (rather than anywhere in the string).

like image 35
Josh Crozier Avatar answered Oct 29 '22 15:10

Josh Crozier