I'm trying to scrape a table in a Wikipedia article and the type of each table element appears to be both <class 'bs4.element.Tag'>
and <class 'bs4.element.NavigableString'>
.
import requests
import bs4
import lxml
resp = requests.get('https://en.wikipedia.org/wiki/List_of_municipalities_in_Massachusetts')
soup = bs4.BeautifulSoup(resp.text, 'lxml')
munis = soup.find(id='mw-content-text')('table')[1]
for muni in munis:
print type(muni)
print '============'
produces the following ouput:
<class 'bs4.element.Tag'>
============
<class 'bs4.element.NavigableString'>
============
<class 'bs4.element.Tag'>
============
<class 'bs4.element.NavigableString'>
============
<class 'bs4.element.Tag'>
============
<class 'bs4.element.NavigableString'>
...
When I try to retrieve muni.contents
I get the AttributeError: 'NavigableString' object has no attribute 'contents'
error.
What am I doing wrong? How do I get the bs4.element.Tag
object for each muni
?
(Using Python 2.7).
Beautiful Soup is a Python library for pulling data out of HTML and XML files. It works with your favorite parser to provide idiomatic ways of navigating, searching, and modifying the parse tree. It commonly saves programmers hours or days of work.
A NavigableString object holds the text within an HTML or an XML tag. This is a Python Unicode string with methods for searching and navigation. Sometimes we may need to navigate to other tags or text within an HTML/XML document based on the current text.
A tag object in BeautifulSoup corresponds to an HTML or XML tag in the actual page or document. >>> from bs4 import BeautifulSoup >>> soup = BeautifulSoup('<b class="boldest">TutorialsPoint</b>') >>> tag = soup. html >>> type(tag) <class 'bs4.element.Tag'>
To use beautiful soup, you need to install it: $ pip install beautifulsoup4 . Beautiful Soup also relies on a parser, the default is lxml . You may already have it, but you should check (open IDLE and attempt to import lxml). If not, do: $ pip install lxml or $ apt-get install python-lxml .
#!/usr/bin/env python
# coding:utf-8
'''黄哥Python'''
import requests
import bs4
from bs4 import BeautifulSoup
# from urllib.request import urlopen
html = requests.get('https://en.wikipedia.org/wiki/List_of_S%26P_500_companies')
soup = BeautifulSoup(html.text, 'lxml')
symbolslist = soup.find('table').tr.next_siblings
for sec in symbolslist:
# print(type(sec))
if type(sec) is not bs4.element.NavigableString:
print(sec.get_text())
If you have spaces in your markup in between nodes BeautifulSoup will turn those into NavigableString. Just put a try catch and see whether the contents are getting fetched as you would want them to -
for muni in munis:
#print type(muni)
try:
print muni.contents
except AttributeError:
pass
print '============'
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With