Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

BeautifulSoup tag is type bs4.element.NavigableString and bs4.element.Tag

I'm trying to scrape a table in a Wikipedia article and the type of each table element appears to be both <class 'bs4.element.Tag'> and <class 'bs4.element.NavigableString'>.

import requests
import bs4
import lxml


resp = requests.get('https://en.wikipedia.org/wiki/List_of_municipalities_in_Massachusetts')

soup = bs4.BeautifulSoup(resp.text, 'lxml')

munis = soup.find(id='mw-content-text')('table')[1]

for muni in munis:
    print type(muni)
    print '============'

produces the following ouput:

<class 'bs4.element.Tag'>
============
<class 'bs4.element.NavigableString'>
============
<class 'bs4.element.Tag'>
============
<class 'bs4.element.NavigableString'>
============
<class 'bs4.element.Tag'>
============
<class 'bs4.element.NavigableString'>
...

When I try to retrieve muni.contents I get the AttributeError: 'NavigableString' object has no attribute 'contents' error.

What am I doing wrong? How do I get the bs4.element.Tag object for each muni?

(Using Python 2.7).

like image 782
lsimmons Avatar asked Nov 30 '16 03:11

lsimmons


People also ask

What is bs4 in BeautifulSoup?

Beautiful Soup is a Python library for pulling data out of HTML and XML files. It works with your favorite parser to provide idiomatic ways of navigating, searching, and modifying the parse tree. It commonly saves programmers hours or days of work.

What is NavigableString?

A NavigableString object holds the text within an HTML or an XML tag. This is a Python Unicode string with methods for searching and navigation. Sometimes we may need to navigate to other tags or text within an HTML/XML document based on the current text.

What is tag in BeautifulSoup?

A tag object in BeautifulSoup corresponds to an HTML or XML tag in the actual page or document. >>> from bs4 import BeautifulSoup >>> soup = BeautifulSoup('<b class="boldest">TutorialsPoint</b>') >>> tag = soup. html >>> type(tag) <class 'bs4.element.Tag'>

How do I use bs4 in Python?

To use beautiful soup, you need to install it: $ pip install beautifulsoup4 . Beautiful Soup also relies on a parser, the default is lxml . You may already have it, but you should check (open IDLE and attempt to import lxml). If not, do: $ pip install lxml or $ apt-get install python-lxml .


2 Answers

#!/usr/bin/env python
# coding:utf-8
'''黄哥Python'''

import requests
import bs4
from bs4 import BeautifulSoup
# from urllib.request import urlopen

html = requests.get('https://en.wikipedia.org/wiki/List_of_S%26P_500_companies')
soup = BeautifulSoup(html.text, 'lxml')

symbolslist = soup.find('table').tr.next_siblings
for sec in symbolslist:
    # print(type(sec))
    if type(sec) is not bs4.element.NavigableString:
        print(sec.get_text())

result screenshot

like image 124
黄哥Python培训 Avatar answered Sep 20 '22 19:09

黄哥Python培训


If you have spaces in your markup in between nodes BeautifulSoup will turn those into NavigableString. Just put a try catch and see whether the contents are getting fetched as you would want them to -

for muni in munis:
    #print type(muni)
    try:
        print muni.contents
    except AttributeError:
        pass
    print '============'
like image 30
Vivek Kalyanarangan Avatar answered Sep 20 '22 19:09

Vivek Kalyanarangan