I'm trying to scrape some data from one web page. There are newlines and <br/>
tags in the tag text. I want to get only the telephone number on the beginning of the tag. Will you give me an advice how to get only the number?
Here is the HTML code:
<td>
+421 48/471 78 14
<br />
<em>(bowling)</em>
</td>
Is there a way in beautifulsoup to get a text in a tag, but only that text, which is not surrounded by other tags? And the second thing: to get rid of text newlines and html newlines?
I use BS4.
The output would be: '+421 48/471 78 14'
Have you any ideas? Thank you
html="""
<td>
+421 48/471 78 14
<br />
<em>(bowling)</em>
</td>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html)
print soup.find("td").contents[0].strip()
+421 48/471 78 14
print soup.find("td").next_element.strip()
+421 48/471 78 14
soup.find("td").contents[0].strip()
finds the contents of the tag
which we get the first element of and remove all the \n
newline chars with str.strip()
.
And from the docs next_element:
The .next_element attribute of a string or tag points to whatever was parsed immediately afterwards
Does it work for you?
>>> from bs4 import BeautifulSoup
>>> str = str.replace("\n", "") # get rid of newlines
>>> str = "<td> +421 48/471 78 14 <br /><em>(bowling)</em></td>"
>>> for item in soup.td.children:
... phone = item # first item is the phone number
... break
...
>>> phone
u' +421 48/471 78 14 '
>>> phone.strip()
u'+421 48/471 78 14'
>>>
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With