Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What exactly is a navigablestring (HTML)?

I'm currently trying to scrape text from webpages using Python 2.7's BeautifulSoup (bs4). My original code is as follows:

string = ''
a = soup.find('div',attrs={"id":"pressrelease"})
[x.extract() for x in a.findAll('script')]
[x.extract() for x in a.findAll("span", {'class':'hidden'})]
    
for element in a:
     try:
         string += element.get_text()
        
     except Exception as e: print(e)

Although my code does get me the desired text, it also spits out the following error: 'NavigableString' object has no attribute 'get_text'. I want to implement a feature that saves the URL if the code hits an exception but in this case, I don't want the URL saved since even though it's throwing an exception the page was successfully scraped. Thus, I am trying to better understand what the error exactly is (so I can decide whether to purposefully ignore this case).

Any explanations on what a navigablestring is and why it would cause my code to throw such an error would be much appreciated!


1 Answers

A NavigableString is a bit of text in your HTML document. See the docs. At least one of the items inside the tag you searched for is a bit of text, probably some white space.

Strings can't contain text, they are text, so they don't have a get_text method, and therefore it is an error to try to call such a method.

You can check each item to see if it's a bs4.element.Tag before trying to call get_text() on them.

for item in a:
     if type(item) is bs4.element.Tag:
         string += item.get_text()

Note I changed your iteration variable to item since the fact that you called it element has probably fixated you on the idea that it is, in fact, an HTML element, when in at least one case it's definitely not.

like image 109
kindall Avatar answered Oct 26 '25 10:10

kindall



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!