Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Beautifulsoup split text in tag by <br/>

Is it possible to split a text from a tag by br tags?

I have this tag contents: [u'+420 777 593 531', <br/>, u'+420 776 593 531', <br/>, u'+420 775 593 531']

And I want to get only numbers. Any advices?

EDIT:

[x for x in dt.find_next_sibling('dd').contents if x!=' <br/>']

Does not work at all.

like image 351
Milano Avatar asked Jun 07 '15 14:06

Milano


People also ask

How do you scrape a tag with BeautifulSoup?

Step 1: The first step will be for scraping we need to import beautifulsoup module and get the request of the website we need to import the requests module. Step 2: The second step will be to request the URL call get method.

Can BeautifulSoup handle broken HTML?

BeautifulSoup is a Python package that parses broken HTML, just like lxml supports it based on the parser of libxml2.

Is tag editable in BeautifulSoup?

The navigablestring object is used to represent the contents of a tag. To access the contents, use “. string” with tag. You can replace the string with another string but you can't edit the existing string.

Is BeautifulSoup a parser?

Beautiful Soup is a Python package for parsing HTML and XML documents (including having malformed markup, i.e. non-closed tags, so named after tag soup). It creates a parse tree for parsed pages that can be used to extract data from HTML, which is useful for web scraping.


1 Answers

You need to test for tags, which are modelled as Element instances. Element objects have a name attribute, while text elements don't (which are NavigableText instances):

[x for x in dt.find_next_sibling('dd').contents if getattr(x, 'name', None) != 'br']

Since you appear to only have text and <br /> elements in that <dd> element, you may as well just get all the contained strings instead:

list(dt.find_next_sibling('dd').stripped_strings)

Demo:

>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup('''\
... <dt>Term</dt>
... <dd>
...     +420 777 593 531<br/>
...     +420 776 593 531<br/>
...     +420 775 593 531<br/>
... </dd>
... ''')
>>> dt = soup.dt
>>> [x for x in dt.find_next_sibling('dd').contents if getattr(x, 'name', None) != 'br']
[u'\n    +420 777 593 531', u'\n    +420 776 593 531', u'\n    +420 775 593 531', u'\n']
>>> list(dt.find_next_sibling('dd').stripped_strings)
[u'+420 777 593 531', u'+420 776 593 531', u'+420 775 593 531']
like image 139
Martijn Pieters Avatar answered Sep 19 '22 22:09

Martijn Pieters