From this html source:
<div class="category_link">
Category:
<a href="/category/personal">Personal</a>
</div>
I wish to extract the text Category:
Here are my attempts using Python/BeautifulSoup (with output as comment - after the #)
parsed = BeautifulSoup(sample_html)
parsed_div = parsed.findAll('div')[0]
parsed_div.firstText() # <a href="/category/personal">Personal</a>
parsed_div.first() # <a href="/category/personal">Personal</a>
parsed_div.findAll()[0] # <a href="/category/personal">Personal</a>
I'd expect a "text node" to be available as the first child. Any suggestions on how I can solve this?
I'm fairly sure the following should do what you want
parsed.find('a').previousSibling # or something like that
That would return a NavigableString
instance which is pretty much the same
thing as a unicode
instance, but you may call unicode
on that to get a
unicode object.
I'll see if I can test this out and let you know.
EDIT: I just confirmed that it works:
>>> from BeautifulSoup import BeautifulSoup
>>> soup = BeautifulSoup('<div class=a>Category: <a href="/">a link</a></div>')
>>> soup.find('a')
<a href="/">a link</a>
>>> soup.find('a').previousSibling
u'Category: '
>>>
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With