I'm using this code to find all interesting links in a page: <pre class="prettyprint"><code>soup.findAll('a', href=re.compile('^notizia.php\?idn=\d+')) </code></pre> And it does its job pretty well. Unfortunately inside that a tag there are a lot of nested tags, like font, b and different things... I'd like to get just the text content, without any other html tag. Example of link: <pre class="prettyprint"><code><A HREF="notizia.php?idn=1134" OnMouseOver="verde();" OnMouseOut="blu();">03-11-2009:&nbsp;&nbsp;CCS Ingegneria Elettronica-Sportello studenti ed orientamento</A> </code></pre> Of course it's ugly (and the markup is not always the same!) and I'd like to get: <pre class="prettyprint"><code>03-11-2009: CCS Ingegneria Elettronica-Sportello studenti ed orientamento </code></pre> In the documentation it says to use <code>text=True</code> in findAll method, but it will ignore my regex. Why? How can I solve that?

I've used this: <pre class="prettyprint"><code>def textOf(soup): return u''.join(soup.findAll(text=True)) </code></pre> So... <pre class="prettyprint"><code>texts = [textOf(n) for n in soup.findAll('a', href=re.compile('^notizia.php\?idn=\d+'))] </code></pre>

BeautifulSoup - easy way to to obtain HTML-free contents

Tags:

python

html-parsing

beautifulsoup

html-content-extraction

I'm using this code to find all interesting links in a page:

soup.findAll('a', href=re.compile('^notizia.php\?idn=\d+'))

And it does its job pretty well. Unfortunately inside that a tag there are a lot of nested tags, like font, b and different things... I'd like to get just the text content, without any other html tag.

Example of link:

<A HREF="notizia.php?idn=1134" OnMouseOver="verde();" OnMouseOut="blu();"><FONT CLASS="v12"><B>03-11-2009:&nbsp;&nbsp;<font color=green>CCS Ingegneria Elettronica-Sportello studenti ed orientamento</B></FONT></A>

Of course it's ugly (and the markup is not always the same!) and I'd like to get:

03-11-2009:  CCS Ingegneria Elettronica-Sportello studenti ed orientamento

In the documentation it says to use text=True in findAll method, but it will ignore my regex. Why? How can I solve that?

857

asked Nov 17 '09 23:11

Andrea Ambu

1 Answers

I've used this:

def textOf(soup):
    return u''.join(soup.findAll(text=True))

So...

texts = [textOf(n) for n in soup.findAll('a', href=re.compile('^notizia.php\?idn=\d+'))]

180

answered Sep 21 '22 00:09

Jonathan Feinberg

Related questions
                            
                                Multiple insert columns if not exist pandas
                            
                                How to add timestamp to each request in uvicorn logs?
                            
                                Bar chart in matplotlib using a colormap
                            
                                flake8 disable linter only for a block of code
                            
                                Compare CSV files content with filecmp and ignore metadata
                            
                                How to filter a set of rows according to an indexed position?
                            
                                Transformer: Error importing packages. "ImportError: cannot import name 'SAVE_STATE_WARNING' from 'torch.optim.lr_scheduler'"
                            
                                How to set up and tear down a database between tests in FastAPI?
                            
                                How does one add default (hidden) values to form templates in Django?
                            
                                How do I add a directory with a colon to PYTHONPATH?
                            
                                Can I use IPython in an embedded interactive Python console?
                            
                                Execute arbitrary python code remotely - can it be done?
                            
                                What's easiest way to get Python script output on the web?
                            
                                datetime.now() in Django application goes bad
                            
                                Python open raw audio data file
                            
                                python csv question [duplicate]
                            
                                Load blob image data into QPixmap
                            
                                python database / sql programming - where to start
                            
                                Entity groups in Google App Engine Datastore
                            
                                Designing the storage for a very large game world

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With