I have been playing with BeautifulSoup, which is great. My end goal is to try and just get the text from a page. I am just trying to get the text from the body, with a special case to get the title and/or alt attributes from <code><a></code> or <code><img></code> tags. So far I have this <code>EDITED & UPDATED CURRENT CODE</code>: <pre class="prettyprint"><code>soup = BeautifulSoup(page) comments = soup.findAll(text=lambda text:isinstance(text, Comment)) [comment.extract() for comment in comments] page = ''.join(soup.findAll(text=True)) page = ' '.join(page.split()) print page </code></pre> 1) What do you suggest the best way for my special case to NOT exclude those attributes from the two tags I listed above? If it is too complex to do this, it isn't as important as doing #2. 2) I would like to strip<code></code> tags and everything in between them. How would I go about that? <code>QUESTION EDIT</code> @jathanism: Here are some comment tags that I have tried to strip, but remain, even when I use your example <pre class="prettyprint"><code>    </code></pre>

Straight from the documentation for BeautifulSoup, you can easily strip comments (or anything) using <code>extract()</code>: <pre class="prettyprint"><code>from BeautifulSoup import BeautifulSoup, Comment soup = BeautifulSoup("""1 <a>23""") comments = soup.findAll(text=lambda text:isinstance(text, Comment)) [comment.extract() for comment in comments] print soup # 1 # <a>23</a> </code></pre>

How can I strip comment tags from HTML using BeautifulSoup?

Tags:

python

beautifulsoup

I have been playing with BeautifulSoup, which is great. My end goal is to try and just get the text from a page. I am just trying to get the text from the body, with a special case to get the title and/or alt attributes from <a> or <img> tags.

So far I have this EDITED & UPDATED CURRENT CODE:

soup = BeautifulSoup(page)
comments = soup.findAll(text=lambda text:isinstance(text, Comment))
[comment.extract() for comment in comments]
page = ''.join(soup.findAll(text=True))
page = ' '.join(page.split())
print page

1) What do you suggest the best way for my special case to NOT exclude those attributes from the two tags I listed above? If it is too complex to do this, it isn't as important as doing #2.

2) I would like to strip tags and everything in between them. How would I go about that?

QUESTION EDIT @jathanism: Here are some comment tags that I have tried to strip, but remain, even when I use your example

<!-- Begin function popUp(URL) { day = new Date(); id = day.getTime(); eval("page" + id + " = window.open(URL, '" + id + "', 'toolbar=0,scrollbars=0,location=0,statusbar=0,menubar=0,resizable=0,width=300,height=330,left = 774,top = 518');"); } // End -->
<!-- var MenuBar1 = new Spry.Widget.MenuBar("MenuBar1", {imgDown:"SpryAssets/SpryMenuBarDownHover.gif", imgRight:"SpryAssets/SpryMenuBarRightHover.gif"}); //--> <!-- var MenuBar1 = new Spry.Widget.MenuBar("MenuBar1", {imgDown:"SpryAssets/SpryMenuBarDownHover.gif", imgRight:"SpryAssets/SpryMenuBarRightHover.gif"}); //--> <!-- var whichlink=0 var whichimage=0 var blenddelay=(ie)? document.images.slide.filters[0].duration*1000 : 0 function slideit(){ if (!document.images) return if (ie) document.images.slide.filters[0].apply() document.images.slide.src=imageholder[whichimage].src if (ie) document.images.slide.filters[0].play() whichlink=whichimage whichimage=(whichimage<slideimages.length-1)? whichimage+1 : 0 setTimeout("slideit()",slidespeed+blenddelay) } slideit() //-->

387

asked Aug 17 '10 21:08

Nathan

1 Answers

Straight from the documentation for BeautifulSoup, you can easily strip comments (or anything) using extract():

from BeautifulSoup import BeautifulSoup, Comment
soup = BeautifulSoup("""1<!--The loneliest number-->
                        <a>2<!--Can be as bad as one--><b>3""")
comments = soup.findAll(text=lambda text:isinstance(text, Comment))
[comment.extract() for comment in comments]
print soup
# 1
# <a>2<b>3</b></a>

answered Sep 30 '22 14:09

jathanism

Related questions
                            
                                How to find the count of a word in a string?
                            
                                How to add a row in a tableWidget PyQT?
                            
                                Why does TensorFlow example fail when increasing batch size?
                            
                                How do I transform a multi-level list into a list of strings in Python?
                            
                                Convert python filenames to unicode
                            
                                Easy_install and Pip doesn't work
                            
                                How can I import a package using __import__() when the package name is only known at runtime?
                            
                                django render_to_response is not defined error
                            
                                How to write multiple strings in one line?
                            
                                django request.user.is_authenticated is always true?
                            
                                Apps won't run on GAE - 'unable to bind to localhost:0'
                            
                                Concatenate all columns in a pandas dataframe
                            
                                Why does unpacking a tuple cause a syntax error?
                            
                                virtualenv on windows 7 returns AssertionError
                            
                                Python find min & max of two lists
                            
                                split string by arbitrary number of white spaces
                            
                                python regex get first part of an email address
                            
                                Gunicorn Connection in Use: ('0.0.0.0', 5000)
                            
                                Python: Remove Exif info from images
                            
                                How do I represent and work with n-bit vectors in Python?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With