Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

BeautifulSoup get_text returns NoneType object

I'm trying BeautifulSoup for web scraping and I need to extract headlines from this webpage, specifically from the 'more' headlines section. This is the code I've tried using so far.

import requests
from bs4 import BeautifulSoup
from csv import writer

response = requests.get('https://www.cnbc.com/finance/?page=1')

soup = BeautifulSoup(response.text,'html.parser')

posts = soup.find_all(id='pipeline')

for post in posts:
    data = post.find_all('li')
    for entry in data:
        title = entry.find(class_='headline')
        print(title)

Running this code gives me ALL the headlines in the page in the following output format:

<div class="headline">
<a class=" " data-nodeid="105372063" href="/2018/08/02/after-apple-rallies-to-1-trillion-even-the-uber-bullish-crowd-on-wal.html">
           {{{*HEADLINE TEXT HERE*}}}
</a> </div>

However, if I use the get_text() method while fetching title in the above code, I only get the first two headlines.

title = entry.find(class_='headline').get_text()

Followed by this error:

Traceback (most recent call last):
  File "C:\Users\Tanay Roman\Documents\python projects\scrapper.py", line 16, in <module>
    title = entry.find(class_='headline').get_text()
AttributeError: 'NoneType' object has no attribute 'get_text'

Why does adding the get_text() method only return partial results. And how do I solve it?

like image 660
Tanay Roman Avatar asked Mar 05 '23 16:03

Tanay Roman


1 Answers

You are misunderstanding the error message. It is not that the .get_text() call returns a NoneType object, it is that objects of type NoneType do not have that method.

There is only ever exactly one object of type NoneType, the value None. Here it was returned by entry.find(class_='headline') because it could not find an element in entry matching the search criteria. In other words, there is, for that entry element, no child element with the class headline.

There are two such <li> elements, one with the id nativedvriver3 and the other with nativedvriver9, and you'd get that error for both. You need to first check if there is a matching element:

for entry in data:
    headline = entry.find(class_='headline')
    if headline is not None:
        title = headline.get_text()

You'd have a much easier time if you used a CSS selector:

headlines = soup.select('#pipeline li .headline')
for headline in headlines:
    headline_text = headline.get_text(strip=True)
    print(headline_text)

This produces:

>>> headlines = soup.select('#pipeline li .headline')
>>> for headline in headlines:
...     headline_text = headline.get_text(strip=True)
...     print(headline_text)
...
Hedge funds fight back against tech in the war for talent
Goldman Sachs sees more price pain ahead for bitcoin
Dish Network shares rise 15% after subscriber losses are less than expected
Bitcoin whale makes ‘enormous’ losing bet, so now other traders have to foot the bill
The 'Netflix of fitness' looks to become a publicly traded stock as soon as next year
Amazon slammed for ‘insult’ tax bill in the UK despite record profits
Nasdaq could plunge 15 percent or more as ‘rolling bear market’ grips stocks: Morgan Stanley
Take-Two shares surge 9% after gamemaker beats expectations due to 'Grand Theft Auto Online'
UK bank RBS announces first dividend in 10 years
Michael Cohen reportedly secured a $10 million deal with Trump donor to advance a nuclear project
After-hours buzz: GPRO, AIG & more
Bitcoin is still too 'unstable' to become mainstream money, UBS says
Apple just hit a trillion but its stock performance has been dwarfed by the other tech giants
The first company to ever reach $1 trillion in market value was in China and got crushed
Apple at a trillion-dollar valuation isn’t crazy like the dot-com bubble
After Apple rallies to $1 trillion, even the uber bullish crowd on Wall Street believes it may need to cool off
like image 188
Martijn Pieters Avatar answered Mar 20 '23 12:03

Martijn Pieters