Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

beautiful soup not to parse nested table data

I have a nested table structure. I am using the below code for parsing the data.

for row in table.find_all("tr")[1:][:-1]:
    for td in row.find_all("td")[1:]:
        dataset = td.get_text()

The problem here is when there are nested tables like in my case there are tables inside <td></td> so these are parsed again after parsing initially as I am using find_all(tr) and find_all(td). So how can I avoid parsing the nested table as it is parsed already?

Input:

<table>
<tr>
   <td>1</td><td>2</td>
</tr>
<tr>
   <td>3</td><td>4</td>
</tr>
<tr>
  <td>5 
    <table><tr><td>11</td><td>22</td></tr></table>
      6
  </td>
</tr>
</table>

Expected Output:

1  2
3  4
5  
11 22
6

But what I am getting is:

1 2
3 4
5
11 22
11 22
6

That is, the inner table is parsed again.

Specs:
beautifulsoup4==4.6.3

Data order should be preserved and content could be anything including any alphanumeric characters.

like image 959
Nagaraju Avatar asked Oct 19 '25 16:10

Nagaraju


1 Answers

Using a combinations of bs4 and re, you might achieve what you want.

I am using bs4 4.6.3

from bs4 import BeautifulSoup as bs
import re

html = '''
<table>
<tr>
   <td>1</td><td>2</td>
</tr>
<tr>
   <td>3</td><td>4</td>
</tr>
<tr>
  <td>5 
    <table><tr><td>11</td><td>22</td></tr></table>
      6
  </td>
</tr>
</table>'''

soup = bs(html, 'lxml')

ans = []

for x in soup.findAll('td'):
    if x.findAll('td'):
        for y in re.split('<table>.*</table>', str(x)):
            ans += re.findall('\d+', y)
    else:
        ans.append(x.text)
print(ans)

For each td we test if this is a nest td. If so, we split on table and take everything and match with a regex every number.

Note this working only for two depths level, but adaptable to any depths

like image 86
BlueSheepToken Avatar answered Oct 22 '25 06:10

BlueSheepToken