Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

BeautifulSoup .link.get("href") only returns None

Playing around with BeautifulSoup working on my webscraper, and for some reason my links variable returns the blocks of code I specify, but as soon as I try to grab the "href" it only spits out "None".

from bs4 import BeautifulSoup
import requests

r = requests.get("https://www.kickstarter.com/discover/advanced?sort=most_funded")

pageGrab = BeautifulSoup(r.content, "html.parser")

#This comment below is another way I tried
#for link in pageGrab.find_all("div", {"class" : "project-profile-title text-truncate-xs"}):

links = pageGrab.find_all("div", {"class" : "project-profile-title text-truncate-xs"})
for link in links:
    print (link.get("href"))

If I also run this script on say, reddit, for example, there are some links which are grabbed but the vast majority result in "None".

This has been my first target on the page for extracting the "href"

<a target="" href="/projects/getpebble/pebble-time-awesome-smartwatch-no-compromises?ref=most_funded">Pebble Time - Awesome Smartwatch, No Compromises</a>
like image 725
En_g_neer Avatar asked Oct 20 '25 05:10

En_g_neer


2 Answers

You are selecting the div elements, which clearly don't have href attributes.

You could simplify your code and use the .select() method and target the children a elements directly:

links = pageGrab.select('.project-profile-title.text-truncate-xs a')
for link in links:
    print (link.get('href'))

Of course you could also use your existing code and chain the .find() method after the div elements; however, that assumes that the div elements will always contain a elements, therefore the code above would be safer to use.

divs = pageGrab.find_all("div", {"class" : "project-profile-title text-truncate-xs"})
for div in divs:
    print (div.find('a').get("href"))

Additionally, if you want to take it a step further, the .select() method accepts a majority of CSS selectors, which means that you could add the [href] attribute selector in order to only select children anchor elements that have href attributes:

links = pageGrab.select('.project-profile-title.text-truncate-xs a[href]')
for link in links:
    print (link.get('href'))
like image 192
Josh Crozier Avatar answered Oct 21 '25 19:10

Josh Crozier


links = pageGrab.find_all("div", {"class" : "project-profile-title text-truncate-xs"})
for link in links:
    print (link.a.get("href"))  # div dose not have href, use div.a find next a tag and get href

out:

/projects/getpebble/pebble-time-awesome-smartwatch-no-compromises?ref=most_funded
/projects/ryangrepper/coolest-cooler-21st-century-cooler-thats-actually?ref=most_funded
/projects/getpebble/pebble-2-time-2-and-core-an-entirely-new-3g-ultra?ref=most_funded
/projects/poots/kingdom-death-monster-15?ref=most_funded
/projects/getpebble/pebble-e-paper-watch-for-iphone-and-android?ref=most_funded
/projects/597538543/the-worlds-best-travel-jacket-with-15-features-bau?ref=most_funded
/projects/elanlee/exploding-kittens?ref=most_funded
/projects/ouya/ouya-a-new-kind-of-video-game-console?ref=most_funded
/projects/peak-design/the-everyday-backpack-tote-and-sling?ref=most_funded
/projects/antsylabs/fidget-cube-a-vinyl-desk-toy?ref=most_funded
like image 41
宏杰李 Avatar answered Oct 21 '25 20:10

宏杰李



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!