So this is how my HTML looks that I'm parsing. It is all within a table and gets repeated multiple times and I just want the href
attribute value that is inside the div with the attribute class="Special_Div_Name"
. All these divs are then inside table rows and there are lots of rows.
<tr>
<div class="Special_Div_Name">
<a href="something.mp3">text</a>
</div>
</tr>
What I want is only the href
attribute values that end in ".mp3" that are inside the div with the attribute class="Special_Div_Name"
.
So far I was able to come up with this code:
download = soup.find_all('a', href = re.compile('.mp3'))
for text in download:
hrefText = (text['href'])
print hrefText
This code currently prints off every href
attribute value on the page that ends in ".mp3" and it's very close to doing exactly what I want. Its just I only want the ".mp3"s that are inside that div class.
To fetch all the elements having <anchor> tagname, we shall use the method find_elements_by_tag_name(). It will fetch a list of elements of anchor tag name as given in the method argument. If there is no matching tagname in the page, an empty list shall be returned.
To find the anchor elements in a particular web page, we need to open the source of web page by using the browser. After that, you can click ctrl+u. Then, you can copy the source code in the text and also click ctrl+h. It is a simple way to find the anchor text.
This minor adjustment should get you what you want:
special_divs = soup.find_all('div',{'class':'Special_Div_Name'})
for text in special_divs:
download = text.find_all('a', href = re.compile('\.mp3$'))
for text in download:
hrefText = (text['href'])
print hrefText
Since Beautiful Soup accepts most CSS selectors with the .select()
method, I'd suggest using the attribute selector [href$=".mp3"]
in order to select a
elements with an href
attribute ending with .mp3
.
Then you can just prepend the selector .Special_Div_Name
in order to only select anchor elements that are descendants:
for a in soup.select('div.Special_Div_Name a[href$=".mp3"]'):
print (a['href'])
In a more general case, if you would just like to select a
elements with an [href]
attribute that are a descendant of a div
element, then you would use the selector div a[href]
:
for a in soup.select('div a[href]'):
print (a)
If you don't use the code above, then based on the original code that you provided, you would need to select all the elements with a class of Special_Div_Name
, then you would need to iterate over those elements and select the descendant anchor elements:
for div in soup.select('.Special_Div_Name'):
for a in div.find_all('a', href = re.compile('\.mp3$')):
print (a['href'])
As a side note, re.compile('.mp3')
should be re.compile('\.mp3$')
since .
has special meaning in a regular expression. In addition, you will also want the anchor $
in order to match at the end of the sting (rather than anywhere in the string).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With