Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Get text before <br/> python/bs4

I'm trying to scrape some data from one web page. There are newlines and <br/> tags in the tag text. I want to get only the telephone number on the beginning of the tag. Will you give me an advice how to get only the number?

Here is the HTML code:

<td>
    +421 48/471 78 14



    <br />
    <em>(bowling)</em>
</td>

Is there a way in beautifulsoup to get a text in a tag, but only that text, which is not surrounded by other tags? And the second thing: to get rid of text newlines and html newlines?

I use BS4.

The output would be: '+421 48/471 78 14'

Have you any ideas? Thank you

like image 475
Milano Avatar asked Aug 24 '14 21:08

Milano


2 Answers

html="""
<td>
    +421 48/471 78 14



    <br />
    <em>(bowling)</em>
</td>
"""


from bs4 import BeautifulSoup

soup = BeautifulSoup(html)

print soup.find("td").contents[0].strip() 
+421 48/471 78 14

print soup.find("td").next_element.strip()
+421 48/471 78 14

soup.find("td").contents[0].strip() finds the contents of the tag which we get the first element of and remove all the \n newline chars with str.strip().

And from the docs next_element:

The .next_element attribute of a string or tag points to whatever was parsed immediately afterwards

like image 197
Padraic Cunningham Avatar answered Sep 30 '22 10:09

Padraic Cunningham


Does it work for you?

>>> from bs4 import BeautifulSoup
>>> str = str.replace("\n", "") # get rid of newlines
>>> str = "<td>   +421 48/471 78 14    <br /><em>(bowling)</em></td>"
>>> for item in soup.td.children:
...   phone = item # first item is the phone number
...   break
... 
>>> phone
u'   +421 48/471 78 14    '
>>> phone.strip()
u'+421 48/471 78 14'
>>> 
like image 21
Tamim Shahriar Avatar answered Sep 30 '22 11:09

Tamim Shahriar