Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

python BeautifulSoup searching a tag

My first post here, I'm trying to find all tags in this specific html and i can't get them out, this is the code:

from bs4 import BeautifulSoup
from urllib import urlopen

url = "http://www.jutarnji.hr"
html_doc = urlopen(url).read()
soup = BeautifulSoup(html_doc)
soup.prettify()
soup.find_all("a", {"class":"black"})

find function returns [], but i see that there are tags with class:"black" in the html, do I miss something?

Thanks, Vedran

like image 833
onoxo Avatar asked Mar 30 '12 17:03

onoxo


People also ask

How to find the first matching tag in beautifulsoup?

BeautifulSoup: find method. find method is used to find the first matching tag. It is similar to passing limit=1 parameter value to the find_all method. Let's take an example: p_tag = soup.find("p") print(p_tag) print("-----") print(p_tag.text) <p class="first">First Paragraph</p> ----- First Paragraph one more example,

What is the use of find_all method in beautifulsoup?

BeautifulSoup: find_all method. find_all method is used to find all the similar tags that we are searching for by prviding the name of the tag as argument to the method. find_all method returns a list containing all the HTML elements that are found. Following is the syntax:

How to get the children of a beautifulsoup object?

The BeautifulSoup object itself has children. In this case, the <html> tag is the child of the BeautifulSoup object − A string does not have .contents, because it can’t contain anything − Instead of getting them as a list, use .children generator to access tag’s children −

How to find all HTML tags containing a given string in Python?

We can use find_all method to find all the HTML tags containing a given string. As the method find_all expects a regular expression to search, hence in the code example below we have used the re module of python for generating a regular expression.


2 Answers

I also had same problem.

Try

soup.findAll("a",{"class":"black"})

instead of

soup.find_all("a",{"class":"black"})

soup.findAll() works well for me.

like image 112
Froyo Avatar answered Sep 22 '22 04:09

Froyo


The problem here is that the website's class tags arent separated from the end of the href attribute value with a space. BeautifulSoup doesnt seem to handle this very well. A reproducable test case is the following

>>> BeautifulSoup.BeautifulSoup('<a href="http://www.jutarnji.hr/crkva-se-ogradila-od--cjenika--don-mikica--osim-krizme--sve-druge-financijske-obveze-su-neprihvatljive/1018314/" class="black">').prettify()
'<a href="http://www.jutarnji.hr/crkva-se-ogradila-od--cjenika--don-mikica--osim-krizme--sve-druge-financijske-obveze-su-neprihvatljive/1018314/" class="black">\n</a>'
>>> BeautifulSoup.BeautifulSoup('<a href="http://www.jutarnji.hr/crkva-se-ogradila-od--cjenika--don-mikica--osim-krizme--sve-druge-financijske-obveze-su-neprihvatljive/1018314/"class="black">').prettify()
''
like image 39
Puneet Avatar answered Sep 19 '22 04:09

Puneet