Use BeautifulSoup to extract text before the first child tag

Question

From this html source:

<div class="category_link">
  Category:
  <a href="/category/personal">Personal</a>
</div>

I wish to extract the text Category:

Here are my attempts using Python/BeautifulSoup (with output as comment - after the #)

parsed = BeautifulSoup(sample_html)
parsed_div = parsed.findAll('div')[0]
parsed_div.firstText() # <a href="/category/personal">Personal</a>
parsed_div.first() # <a href="/category/personal">Personal</a>
parsed_div.findAll()[0] # <a href="/category/personal">Personal</a>

I'd expect a "text node" to be available as the first child. Any suggestions on how I can solve this?

sharat87 · Accepted Answer

I'm fairly sure the following should do what you want

parsed.find('a').previousSibling # or something like that

That would return a NavigableString instance which is pretty much the same thing as a unicode instance, but you may call unicode on that to get a unicode object.

I'll see if I can test this out and let you know.

EDIT: I just confirmed that it works:

>>> from BeautifulSoup import BeautifulSoup
>>> soup = BeautifulSoup('<div class=a>Category: <a href="/">a link</a></div>')
>>> soup.find('a')
<a href="/">a link</a>
>>> soup.find('a').previousSibling
u'Category: '
>>>

Use BeautifulSoup to extract text before the first child tag

Tags:

python

beautifulsoup

Elvis D'Souza

1 Answers

sharat87

Recent Activity

Donate For Us

Use BeautifulSoup to extract text before the first child tag

Tags:

python

beautifulsoup

Elvis D'Souza

1 Answers

sharat87

Related questions

Recent Activity

Donate For Us