Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Extracting text outside of a <div> tag BeautifulSoup

So Im practicing my scraping and I came across something like this:

<div class="profileDetail">
    <div class="profileLabel">Mobile : </div>
     021 427 399 
</div>

and I need the number outside of the <div> tag:

My code is:

num = soup.find("div",{"class":"profileLabel"}).text

but the output of that is Mobile : only it's the text inside the <div> tag not the text outside of it.

so how do we extract the text outside of the <div> tag?

like image 889
Zion Avatar asked Jul 30 '15 18:07

Zion


2 Answers

I would make a reusable function to get the value by label, finding the label by text and getting the next sibling:

import re

def find_by_label(soup, label):
    return soup.find("div", text=re.compile(label)).next_sibling

Usage:

find_by_label(soup, "Mobile").strip()  # prints "021 427 399"
like image 184
alecxe Avatar answered Sep 20 '22 02:09

alecxe


try using soup.find("div",{"class":"profileLabel"}).next_sibling, this will grab the next element, which can be either a bs4.Tag or a bs4.NavigableString.

bs4.NavigableString is what your trying to get in this case.

elem = soup.find("div",{"class":"profileLabel"}).next_sibling
print type(elem)

# Should return
bs4.element.NavigableString

Example:

In [4]: s = bs4.BeautifulSoup('<div> Hello </div>HiThere<p>next_items</p>', 'html5lib')

In [5]: s
Out[5]: <html><head></head><body><div> Hello </div>HiThere<p>next_items</p></body></html>

In [6]: s.div
Out[6]: <div> Hello </div>

In [7]: s.div.next_sibling
Out[7]: u'HiThere'

In [8]: type(s.div.next_sibling)
Out[8]: bs4.element.NavigableString
like image 32
Brandon Nadeau Avatar answered Sep 21 '22 02:09

Brandon Nadeau