Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to extract innerHTML from tag using BeautifulSoup in Python

I am trying to extract the innerHTML from a tag using the following code:

theurl = "http://na.op.gg/summoner/userName=Darshan"
thepage = urlopen(theurl)
soup = BeautifulSoup(thepage,"html.parser")
rank = soup.findAll('span',{"class":"tierRank"})

However I am getting [< span class="tierRank" > Master < /span >] instead. What I want to show is the value "Master" only.

Using soup.get_text instead of soup.findall doesn't work.

I tried adding .text and .string to the end of last line but that did not work either.

like image 460
Naveen Manoharan Avatar asked Apr 19 '18 01:04

Naveen Manoharan


People also ask

Is Tag an object of BeautifulSoup?

A tag object in BeautifulSoup corresponds to an HTML or XML tag in the actual page or document. Tags contain lot of attributes and methods and two important features of a tag are its name and attributes.

What is Attrs in BeautifulSoup?

contents attribute of a BeautifulSoup object is a list with all its children elements. If the current element does not contain nested HTML elements, then . contents[0] will be just the text inside it.

Is BeautifulSoup a parser?

Beautiful Soup is a Python package for parsing HTML and XML documents (including having malformed markup, i.e. non-closed tags, so named after tag soup). It creates a parse tree for parsed pages that can be used to extract data from HTML, which is useful for web scraping.


1 Answers

soup.findAll('span',{"class":"tierRank"}) returns a list of elements that match <span class="tierRank">.

  1. You want the first element from that list.
  2. You want the innerHtml from that element, which can be accessed by the decode_contents() method.

All together:

rank = soup.findAll('span',{"class":"tierRank"})[0].decode_contents()

This will store "Master" in rank.

like image 162
Matt Morgan Avatar answered Sep 29 '22 14:09

Matt Morgan