Get text before
python/bs4

Question

I'm trying to scrape some data from one web page. There are newlines and <br/> tags in the tag text. I want to get only the telephone number on the beginning of the tag. Will you give me an advice how to get only the number?

Here is the HTML code:

<td>
    +421 48/471 78 14



    <br />
    <em>(bowling)</em>
</td>

Is there a way in beautifulsoup to get a text in a tag, but only that text, which is not surrounded by other tags? And the second thing: to get rid of text newlines and html newlines?

I use BS4.

The output would be: '+421 48/471 78 14'

Have you any ideas? Thank you

Padraic Cunningham · Accepted Answer

html="""
<td>
    +421 48/471 78 14



    <br />
    <em>(bowling)</em>
</td>
"""


from bs4 import BeautifulSoup

soup = BeautifulSoup(html)

print soup.find("td").contents[0].strip() 
+421 48/471 78 14

print soup.find("td").next_element.strip()
+421 48/471 78 14

soup.find("td").contents[0].strip() finds the contents of the tag which we get the first element of and remove all the newline chars with str.strip().

And from the docs next_element:

The .next_element attribute of a string or tag points to whatever was parsed immediately afterwards

Tamim Shahriar · Answer

Does it work for you?

>>> from bs4 import BeautifulSoup
>>> str = str.replace("
", "") # get rid of newlines
>>> str = "<td>   +421 48/471 78 14    <br /><em>(bowling)</em></td>"
>>> for item in soup.td.children:
...   phone = item # first item is the phone number
...   break
... 
>>> phone
u'   +421 48/471 78 14    '
>>> phone.strip()
u'+421 48/471 78 14'
>>>

Get text before <br/> python/bs4

Tags:

python

html

beautifulsoup

Milano

2 Answers

Padraic Cunningham

Tamim Shahriar

Recent Activity

Donate For Us