Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What how I parse an html string with beautifulsoup that has inner tags within text

I have a the following html content in a variable and need a way to read the text from the html by removing the inner tags html=<td class="row">India (ASIA) (<a href="/asia/india">india</a>&nbsp;–&nbsp;<a href="/asia/india">photos</a>)</td>

I just want to extract the string India (ASIA) out of this with BeautifulSoup. Is it possible or should I resort to use regular expressions for this.

like image 818
Kshitiz Gupta Avatar asked Nov 15 '25 11:11

Kshitiz Gupta


1 Answers

This is one possible way using beautifulsoup, by extracting text content before child element <a> :

from bs4 import BeautifulSoup

html = """<td class="row">India (ASIA) (<a href="/asia/india">india</a>&nbsp;–&nbsp;<a href="/asia/india">photos</a>)</td>"""
soup = BeautifulSoup(html)
result = soup.find("a").previousSibling
print(result.decode('utf-8'))

output :

India (ASIA) (

tweaking the code further to remove trailing ( from result should be straightforward

like image 117
har07 Avatar answered Nov 18 '25 21:11

har07



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!