Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Extracting anchor text from span class with BeautifulSoup

This is the html I am trying to scrape:

<span class="meta-attributes__attr-tags">
<a href="/tags/cinematic" title="cinematic">cinematic</a>, 
<a href="/tags/dissolve" title="dissolve">dissolve</a>,
<a href="/tags/epic" title="epic">epic</a>,
<a href="/tags/fly" title="fly">fly</a>,
</span>

I want to get the anchor text for each a href: cinematic, dissolve, epic, etc.

This is the code I have:

url = urllib2.urlopen("http: example.com")

content = url.read()
soup = BeautifulSoup(content)

links = soup.find_all("span", {"class": "meta-attributes__attr-tags"})
for link in links:
    print link.find_all('a')['href']

If I do it with "link.find_all" I get error: TypeError: List indices must be integers, not str.

But if I do print link.find('a')['href'] I get the first one only.

How can I get all of them ?

like image 348
Alex TheWebGroup Avatar asked Dec 19 '22 20:12

Alex TheWebGroup


2 Answers

You could do the following:

from bs4 import BeautifulSoup

content = '''
<span class="meta-attributes__attr-tags">
<a href="/tags/cinematic" title="cinematic">cinematic</a>, 
<a href="/tags/dissolve" title="dissolve">dissolve</a>,
<a href="/tags/epic" title="epic">epic</a>,
<a href="/tags/fly" title="fly">fly</a>,
</span>
'''

soup = BeautifulSoup(content)
spans = soup.find_all("span", {"class": "meta-attributes__attr-tags"})
for span in spans:
    links = span.find_all('a')
    for link in links:
        print link['href']

Output

/tags/cinematic
/tags/dissolve
/tags/epic
/tags/fly
like image 70
gtlambert Avatar answered Dec 21 '22 08:12

gtlambert


from bs4 import BeautifulSoup

html = """
<span class="meta-attributes__attr-tags">
<a href="/tags/cinematic" title="cinematic">cinematic</a>, 
<a href="/tags/dissolve" title="dissolve">dissolve</a>,
<a href="/tags/epic" title="epic">epic</a>,
<a href="/tags/fly" title="fly">fly</a>,
</span>
"""

soup = BeautifulSoup(html, "lxml")
spans = soup.find_all("span", {"class": "meta-attributes__attr-tags"})

for span in spans:
    for link in span.find_all('a'):
        print link.text, link['href']

Another, pricier, way could be:

from bs4 import BeautifulSoup

html = """
<span class="meta-attributes__attr-tags">
<a href="/tags/cinematic" title="cinematic">cinematic</a>,
<a href="/tags/dissolve" title="dissolve">dissolve</a>,
<a href="/tags/epic" title="epic">epic</a>,
<a href="/tags/fly" title="fly">fly</a>,
</span>
"""

soup = BeautifulSoup(html, "lxml")
links = soup.find_all("a")

for link in links:
    if 'meta-attributes__attr-tags' not in link.parent.get('class', []):
        continue

    print link.text, link['href']
like image 31
Dušan Maďar Avatar answered Dec 21 '22 10:12

Dušan Maďar