I'm trying to scrape some data from one web page. There are newlines and <br/> tags in the tag text. I want to get only the telephone number on the beginning of the tag. Will you give me an advice how to get only the number? 
Here is the HTML code:
<td>
    +421 48/471 78 14
    <br />
    <em>(bowling)</em>
</td>
Is there a way in beautifulsoup to get a text in a tag, but only that text, which is not surrounded by other tags? And the second thing: to get rid of text newlines and html newlines?
I use BS4.
The output would be: '+421 48/471 78 14'
Have you any ideas? Thank you
html="""
<td>
    +421 48/471 78 14
    <br />
    <em>(bowling)</em>
</td>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html)
print soup.find("td").contents[0].strip() 
+421 48/471 78 14
print soup.find("td").next_element.strip()
+421 48/471 78 14
soup.find("td").contents[0].strip() finds the contents of the tag which we get the first element of and  remove all the \n newline chars with str.strip(). 
And from the docs next_element:
The .next_element attribute of a string or tag points to whatever was parsed immediately afterwards
Does it work for you?
>>> from bs4 import BeautifulSoup
>>> str = str.replace("\n", "") # get rid of newlines
>>> str = "<td>   +421 48/471 78 14    <br /><em>(bowling)</em></td>"
>>> for item in soup.td.children:
...   phone = item # first item is the phone number
...   break
... 
>>> phone
u'   +421 48/471 78 14    '
>>> phone.strip()
u'+421 48/471 78 14'
>>> 
                        If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With