Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Use BeautifulSoup to extract text before the first child tag

From this html source:

<div class="category_link">
  Category:
  <a href="/category/personal">Personal</a>
</div>

I wish to extract the text Category:

Here are my attempts using Python/BeautifulSoup (with output as comment - after the #)

parsed = BeautifulSoup(sample_html)
parsed_div = parsed.findAll('div')[0]
parsed_div.firstText() # <a href="/category/personal">Personal</a>
parsed_div.first() # <a href="/category/personal">Personal</a>
parsed_div.findAll()[0] # <a href="/category/personal">Personal</a>

I'd expect a "text node" to be available as the first child. Any suggestions on how I can solve this?

like image 684
Elvis D'Souza Avatar asked Apr 14 '12 14:04

Elvis D'Souza


1 Answers

I'm fairly sure the following should do what you want

parsed.find('a').previousSibling # or something like that

That would return a NavigableString instance which is pretty much the same thing as a unicode instance, but you may call unicode on that to get a unicode object.

I'll see if I can test this out and let you know.

EDIT: I just confirmed that it works:

>>> from BeautifulSoup import BeautifulSoup
>>> soup = BeautifulSoup('<div class=a>Category: <a href="/">a link</a></div>')
>>> soup.find('a')
<a href="/">a link</a>
>>> soup.find('a').previousSibling
u'Category: '
>>> 
like image 97
sharat87 Avatar answered Sep 18 '22 17:09

sharat87