I'm trying to scrape some data for my app. My question is I need some Here is the HTML code:
<tr>
<td>
This
<a class="tip info" href="blablablablabla">is a first</a>
sentence.
<br>
This
<a class="tip info" href="blablablablabla">is a second</a>
sentence.
<br>This
<a class="tip info" href="blablablablabla">is a third</a>
sentence.
<br>
</td>
</tr>
I want output to looks like
This is a first sentence.
This is a second sentence.
This is a third sentence.
Is it possible to do that?
It's certainly possible. I'll answer in slightly greater generality because I doubt that you want merely to process that chunk of HTML.
First, get a pointer to the td
element,
td = soup.find('td')
Now, notice that you can get a list of this element's children,
>>> td_kids = list(td.children)
>>> td_kids
['\n This\n ', <a class="tip info" href="blablablablabla">is a first</a>, '\n sentence.\n ', <br/>, '\n This\n ', <a class="tip info" href="blablablablabla">is a second</a>, '\n sentence.\n ', <br/>, 'This\n ', <a class="tip info" href="blablablablabla">is a third</a>, '\n sentence.\n ', <br/>, '\n']
Some of the items in this list are string, some are HTML elements. Crucially, some are br
elements.
You could split the list first of all into one or more lists by looking for,
isinstance(td_kid[<some k>], bs4.element.Tag)
for each item in the list.
Then, you could go through each of the sublists repeatedly replacing tags by turning them into soup and then getting the lists of children for these. Eventually, you will have several sublists containing only what BeautifulSoup calls 'navigable strings' that you can manipulate as usual.
Join the elements together, then I would suggest that you eliminate white space using a regex sub
like this:
result = re.sub(r'\s{2,}', '', <joined list>)
Try this. It should give you the desired output. Just consider the content
variable used within the below script to be the holder of your above pasted html elements
.
from bs4 import BeautifulSoup
soup = BeautifulSoup(content,"lxml")
items = ','.join([''.join([item.previous_sibling,item.text,item.next_sibling]) for item in soup.select(".tip.info")])
data = ' '.join(items.split()).replace(",","\n")
print(data)
Output:
This is a first sentence.
This is a second sentence.
This is a third sentence.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With