Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Using BeautifulSoup to parse lines separated by <br> tags?

I have a page that looks like this:

Company A<br />
123 Main St.<br />
Suite 101<br />
Someplace, NY 1234<br />
<br />
<br />
<br />
Company B<br />
456 Main St.<br />
Someplace, NY 1234<br />
<br />
<br />
<br />

Sometimes there are two rather than three "br" tags separating the entries. How would I use BeautifulSoup to parse through this document and extract the fields? I'm stumped because the bits of text that I need are not contained in paragraph (or similar) tags that I can simply iterate through.

like image 264
jamieb Avatar asked Feb 21 '10 07:02

jamieb


People also ask

How do you scrape a tag with BeautifulSoup?

Step 1: The first step will be for scraping we need to import beautifulsoup module and get the request of the website we need to import the requests module. Step 2: The second step will be to request the URL call get method.

Can BeautifulSoup handle broken HTML?

Does BeautifulSoup handle broken HTML? BeautifulSoup is a Python package that parses broken HTML, just like lxml supports it based on the parser of libxml2.

What is parsing in BeautifulSoup?

Beautiful Soup is a Python package for parsing HTML and XML documents (including having malformed markup, i.e. non-closed tags, so named after tag soup). It creates a parse tree for parsed pages that can be used to extract data from HTML, which is useful for web scraping.


1 Answers

You should look into the .stringsattribute found in tags, then use "\n".join() on that.

like image 50
ychaouche Avatar answered Sep 23 '22 03:09

ychaouche