Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to get value between two different tags using beautiful soup?

I need to extract data present between a ending tag and a
tag in below code snippet:

<td><b>First Type :</b>W<br><b>Second Type :</b>65<br><b>Third Type :</b>3</td>

What I need is : W, 65, 3

But the problem is that these values can be empty too, like-

<td><b>First Type :</b><br><b>Second Type :</b><br><b>Third Type :</b></td>

I want to get these values if present else an empty string

I tried making use of nextSibling and find_next('br') but it returned

 <br><b>Second Type :</b><br><b>Third Type :</b></br></br>

and

<br><b>Third Type :</b></br>

in case if values(W, 65, 3) are not present between the tags

</b> and <br> 

All I need is that it should return a empty string if nothing is present between those tags.

like image 470
utkarsh awasthi Avatar asked Mar 02 '17 11:03

utkarsh awasthi


2 Answers

I would use a <b> tag by </b> tag strategy, looking at what type of info their next_sibling contains.

I would just check whether their next_sibling.string is not None, and accordingly append the list :)

>>> html = """<td><b>First Type :</b><br><b>Second Type :</b>65<br><b>Third Type :</b>3</td>"""

>>> soup = BeautifulSoup(html, "html.parser")
>>> b = soup.find_all("b")
>>> data = []
>>> for tag in b:
        if tag.next_sibling.string == None:
            data.append(" ")
        else:
            data.append(tag.next_sibling.string)
>>> data 
[' ', u'65', u'3'] # Having removed the first string

Hope this helps!

like image 65
pedropedro Avatar answered Sep 19 '22 12:09

pedropedro


I would search for a td object then use a regex pattern to filter the data that you need, instead of using re.compile in the find_all method.

Like this:

import re
from bs4 import BeautifulSoup

example = """<td><b>First Type :</b>W<br><b>Second Type :</b>65<br><b>Third 
Type :</b>3</td>
<td><b>First Type :</b><br><b>Second Type :</b>69<br><b>Third Type :</b>6</td>"""

soup = BeautifulSoup(example, "html.parser")

for o in soup.find_all('td'):
    match = re.findall(r'</b>\s*(.*?)\s*(<br|</br)', str(o))
    print ("%s,%s,%s" % (match[0][0],match[1][0],match[2][0]))

This pattern finds all text between the </b> tag and <br> or </br> tags. The </br> tags are added when converting the soup object to string.

This example outputs:

W,65,3

,69,6

Just an example, you can alter to return an empty string if one of the regex matches is empty.

like image 31
Zroq Avatar answered Sep 18 '22 12:09

Zroq