Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

BeautifulSoup - easy way to to obtain HTML-free contents

I'm using this code to find all interesting links in a page:

soup.findAll('a', href=re.compile('^notizia.php\?idn=\d+'))

And it does its job pretty well. Unfortunately inside that a tag there are a lot of nested tags, like font, b and different things... I'd like to get just the text content, without any other html tag.

Example of link:

<A HREF="notizia.php?idn=1134" OnMouseOver="verde();" OnMouseOut="blu();"><FONT CLASS="v12"><B>03-11-2009:&nbsp;&nbsp;<font color=green>CCS Ingegneria Elettronica-Sportello studenti ed orientamento</B></FONT></A>

Of course it's ugly (and the markup is not always the same!) and I'd like to get:

03-11-2009:  CCS Ingegneria Elettronica-Sportello studenti ed orientamento

In the documentation it says to use text=True in findAll method, but it will ignore my regex. Why? How can I solve that?

like image 857
Andrea Ambu Avatar asked Nov 17 '09 23:11

Andrea Ambu


People also ask

How do you select HTML elements in Python?

If you need to select DOM elements from its tag ( <p> , <a> , <span> , ....) you can simply do soup. <tag> to select it. The caveat is that it will only select the first HTML element with that tag. This is a simple example.

Is BeautifulSoup library is used to parse the document and for extracting HTML documents?

BeautifulSoup is a Python library for parsing HTML and XML documents. It is often used for web scraping. BeautifulSoup transforms a complex HTML document into a complex tree of Python objects, such as tag, navigable string, or comment.

Can BeautifulSoup handle broken HTML?

It is not a real HTML parser but uses regular expressions to dive through tag soup. It is therefore more forgiving in some cases and less good in others. It is not uncommon that lxml/libxml2 parses and fixes broken HTML better, but BeautifulSoup has superiour support for encoding detection.


1 Answers

I've used this:

def textOf(soup):
    return u''.join(soup.findAll(text=True))

So...

texts = [textOf(n) for n in soup.findAll('a', href=re.compile('^notizia.php\?idn=\d+'))]
like image 180
Jonathan Feinberg Avatar answered Sep 21 '22 00:09

Jonathan Feinberg