I'm using this code to find all interesting links in a page:
soup.findAll('a', href=re.compile('^notizia.php\?idn=\d+'))
And it does its job pretty well. Unfortunately inside that a tag there are a lot of nested tags, like font, b and different things... I'd like to get just the text content, without any other html tag.
Example of link:
<A HREF="notizia.php?idn=1134" OnMouseOver="verde();" OnMouseOut="blu();"><FONT CLASS="v12"><B>03-11-2009: <font color=green>CCS Ingegneria Elettronica-Sportello studenti ed orientamento</B></FONT></A>
Of course it's ugly (and the markup is not always the same!) and I'd like to get:
03-11-2009: CCS Ingegneria Elettronica-Sportello studenti ed orientamento
In the documentation it says to use text=True
in findAll method, but it will ignore my regex. Why? How can I solve that?
If you need to select DOM elements from its tag ( <p> , <a> , <span> , ....) you can simply do soup. <tag> to select it. The caveat is that it will only select the first HTML element with that tag. This is a simple example.
BeautifulSoup is a Python library for parsing HTML and XML documents. It is often used for web scraping. BeautifulSoup transforms a complex HTML document into a complex tree of Python objects, such as tag, navigable string, or comment.
It is not a real HTML parser but uses regular expressions to dive through tag soup. It is therefore more forgiving in some cases and less good in others. It is not uncommon that lxml/libxml2 parses and fixes broken HTML better, but BeautifulSoup has superiour support for encoding detection.
I've used this:
def textOf(soup):
return u''.join(soup.findAll(text=True))
So...
texts = [textOf(n) for n in soup.findAll('a', href=re.compile('^notizia.php\?idn=\d+'))]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With