I am using Beautiful Soup to parse a html to find all text that is
1.Not contained inside any anchor elements
I came up with this code which finds all links within href but not the other way around.
How can I modify this code to get only plain text using Beautiful Soup, so that I can do some find and replace and modify the soup?
for a in soup.findAll('a',href=True):
print a['href']
EDIT:
Example:
<html><body>
<div> <a href="www.test1.com/identify">test1</a> </div>
<div><br></div>
<div><a href="www.test2.com/identify">test2</a></div>
<div><br></div><div><br></div>
<div>
This should be identified
Identify me 1
Identify me 2
<p id="firstpara" align="center"> This paragraph should be<b> identified </b>.</p>
</div>
</body></html>
Output:
This should be identified
Identify me 1
Identify me 2
This paragraph should be identified.
I am doing this operation to find text not within <a></a> : then find "Identify" and do replace operation with "Replaced"
So the final output will be like this:
<html><body>
<div> <a href="www.test1.com/identify">test1</a> </div>
<div><br></div>
<div><a href="www.test2.com/identify">test2</a></div>
<div><br></div><div><br></div>
<div>
This should be identified
Repalced me 1
Replaced me 2
<p id="firstpara" align="center"> This paragraph should be<b> identified </b>.</p>
</div>
</body></html>
Thanks for your time !
If I understand you correct, you want to get the text that is inside an a element that contains an href attribute. If you want to get the text of the element, you can use the .text attribute.
>>> soup = BeautifulSoup.BeautifulSoup()
>>> soup.feed('<a href="http://something.com">this is some text</a>')
>>> soup.findAll('a', href=True)[0]['href']
u'http://something.com'
>>> soup.findAll('a', href=True)[0].text
u'this is some text'
Edit
This finds all the text elements, with identified in them:
>>> soup = BeautifulSoup.BeautifulSoup()
>>> soup.feed(yourhtml)
>>> [txt for txt in soup.findAll(text=True) if 'identified' in txt.lower()]
[u'\n This should be identified \n\n Identify me 1 \n\n Identify me 2 \n ', u' identified ']
The returned objects are of type BeautifulSoup.NavigableString. If you want to check if the parent is an a element you can do txt.parent.name == 'a'.
Another edit:
Here's another example with a regex and a replacement.
import BeautifulSoup
import re
soup = BeautifulSoup.BeautifulSoup()
html = '''
<html><body>
<div> <a href="www.test1.com/identify">test1</a> </div>
<div><br></div>
<div><a href="www.test2.com/identify">test2</a></div>
<div><br></div><div><br></div>
<div>
This should be identified
Identify me 1
Identify me 2
<p id="firstpara" align="center"> This paragraph should be<b> identified </b>.</p>
</div>
</body></html>
'''
soup.feed(html)
for txt in soup.findAll(text=True):
if re.search('identi',txt,re.I) and txt.parent.name != 'a':
newtext = re.sub(r'identi(\w+)', r'replace\1', txt.lower())
txt.replaceWith(newtext)
print(soup)
<html><body>
<div> <a href="www.test1.com/identify">test1</a> </div>
<div><br /></div>
<div><a href="www.test2.com/identify">test2</a></div>
<div><br /></div><div><br /></div>
<div>
this should be replacefied
replacefy me 1
replacefy me 2
<p id="firstpara" align="center"> This paragraph should be<b> replacefied </b>.</p>
</div>
</body></html>
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With