I have a the following html content in a variable and need a way to read the text from the html by removing the inner tags
html=<td class="row">India (ASIA) (<a href="/asia/india">india</a> – <a href="/asia/india">photos</a>)</td>
I just want to extract the string India (ASIA) out of this with BeautifulSoup. Is it possible or should I resort to use regular expressions for this.
This is one possible way using beautifulsoup, by extracting text content before child element <a> :
from bs4 import BeautifulSoup
html = """<td class="row">India (ASIA) (<a href="/asia/india">india</a> – <a href="/asia/india">photos</a>)</td>"""
soup = BeautifulSoup(html)
result = soup.find("a").previousSibling
print(result.decode('utf-8'))
output :
India (ASIA) (
tweaking the code further to remove trailing ( from result should be straightforward
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With