Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to parse table with rowspan and colspan

First, I have read Parsing a table with rowspan and colspan. I even answered the question. Please read before you mark this as duplicate.

<table border="1">
  <tr>
    <th>A</th>
    <th>B</th>
  </tr>
  <tr>
    <td rowspan="2">C</td>
    <td rowspan="1">D</td>
  </tr>
  <tr>
    <td>E</td>
    <td>F</td>
  </tr>
  <tr>
    <td>G</td>
    <td>H</td>
  </tr>
</table>

It will render like

+---+---+---+
| A | B |   |
+---+---+   |
|   | D |   |
+ C +---+---+
|   | E | F |
+---+---+---+
| G | H |   |
+---+---+---+
<table border="1">
  <tr>
    <th>A</th>
    <th>B</th>
  </tr>
  <tr>
    <td rowspan="2">C</td>
    <td rowspan="2">D</td>
  </tr>
  <tr>
    <td>E</td>
    <td>F</td>
  </tr>
  <tr>
    <td>G</td>
    <td>H</td>
  </tr>
</table>

However, this will render like this.

+---+---+-------+
| A | B |       |
+---+---+-------+
|   |   |       |
| C | D +---+---+
|   |   | E | F |
+---+---+---+---+
| G | H |       |
+---+---+---+---+

My code from previous answer can only parse table which has all the columns defined in the first row.

def table_to_2d(table_tag):
    rows = table_tag("tr")
    cols = rows[0](["td", "th"])
    table = [[None] * len(cols) for _ in range(len(rows))]
    for row_i, row in enumerate(rows):
        for col_i, col in enumerate(row(["td", "th"])):
            insert(table, row_i, col_i, col)
    return table


def insert(table, row, col, element):
    if row >= len(table) or col >= len(table[row]):
        return
    if table[row][col] is None:
        value = element.get_text()
        table[row][col] = value
        if element.has_attr("colspan"):
            span = int(element["colspan"])
            for i in range(1, span):
                table[row][col+i] = value
        if element.has_attr("rowspan"):
            span = int(element["rowspan"])
            for i in range(1, span):
                table[row+i][col] = value
    else:
        insert(table, row, col + 1, element)

soup = BeautifulSoup('''
    <table>
        <tr><th>1</th><th>2</th><th>5</th></tr>
        <tr><td rowspan="2">3</td><td colspan="2">4</td></tr>
        <tr><td>6</td><td>7</td></tr>
    </table>''', 'html.parser')
print(table_to_2d(soup.table))

My question is how to parse a table into a 2D array which represent exactly how it render in browser. Or someone can explain how the browser renders the table is also fine.

like image 224
Joshua Avatar asked Jan 23 '18 02:01

Joshua


2 Answers

You can't just count td or th cells, no. You'll have to do a scan across the table to get the number of columns on each row, adding to that count any active rowspans from a preceding row.

In a different scenario parsing a table with rowspans I tracked rowspan counts per column number to ensure that data from different cells ended up in the correct column. A similar technique can be used here.

First count columns; keep only the highest number. Keep a list of rowspan numbers of 2 or greater and subtract 1 from each for every row of columns you process. That way you know how many 'extra' columns there are on each row. Take the highest column count to build your output matrix.

Next, loop over the rows and cells again, and this time track rowspans in a dictionary mapping from column number to active count. Again, cary over anything with a value of 2 or over to the next row. Then shift column numbers to account for any rowspans that are active; the first td in a row would actually be the second if there was a rowspan active on column 0, etc.

Your code copies the value for spanned columns and rows into the output repeatedly; I achieved the same by creating a loop over the colspan and rowspan numbers of a given cell (each defaulting to 1) to copy the value multiple times. I’m ignoring overlapping cells; the HTML table specifications state that overlapping cells are an error and it is up to the user agent to resolve conflicts. In the code below colspan trumps rowspan cells.

from itertools import product

def table_to_2d(table_tag):
    rowspans = []  # track pending rowspans
    rows = table_tag.find_all('tr')

    # first scan, see how many columns we need
    colcount = 0
    for r, row in enumerate(rows):
        cells = row.find_all(['td', 'th'], recursive=False)
        # count columns (including spanned).
        # add active rowspans from preceding rows
        # we *ignore* the colspan value on the last cell, to prevent
        # creating 'phantom' columns with no actual cells, only extended
        # colspans. This is achieved by hardcoding the last cell width as 1. 
        # a colspan of 0 means “fill until the end” but can really only apply
        # to the last cell; ignore it elsewhere. 
        colcount = max(
            colcount,
            sum(int(c.get('colspan', 1)) or 1 for c in cells[:-1]) + len(cells[-1:]) + len(rowspans))
        # update rowspan bookkeeping; 0 is a span to the bottom. 
        rowspans += [int(c.get('rowspan', 1)) or len(rows) - r for c in cells]
        rowspans = [s - 1 for s in rowspans if s > 1]

    # it doesn't matter if there are still rowspan numbers 'active'; no extra
    # rows to show in the table means the larger than 1 rowspan numbers in the
    # last table row are ignored.

    # build an empty matrix for all possible cells
    table = [[None] * colcount for row in rows]

    # fill matrix from row data
    rowspans = {}  # track pending rowspans, column number mapping to count
    for row, row_elem in enumerate(rows):
        span_offset = 0  # how many columns are skipped due to row and colspans 
        for col, cell in enumerate(row_elem.find_all(['td', 'th'], recursive=False)):
            # adjust for preceding row and colspans
            col += span_offset
            while rowspans.get(col, 0):
                span_offset += 1
                col += 1

            # fill table data
            rowspan = rowspans[col] = int(cell.get('rowspan', 1)) or len(rows) - row
            colspan = int(cell.get('colspan', 1)) or colcount - col
            # next column is offset by the colspan
            span_offset += colspan - 1
            value = cell.get_text()
            for drow, dcol in product(range(rowspan), range(colspan)):
                try:
                    table[row + drow][col + dcol] = value
                    rowspans[col + dcol] = rowspan
                except IndexError:
                    # rowspan or colspan outside the confines of the table
                    pass

        # update rowspan bookkeeping
        rowspans = {c: s - 1 for c, s in rowspans.items() if s > 1}

    return table

This parses your sample table correctly:

>>> from pprint import pprint
>>> pprint(table_to_2d(soup.table), width=30)
[['1', '2', '5'],
 ['3', '4', '4'],
 ['3', '6', '7']]

and handles your other examples; first table:

>>> table1 = BeautifulSoup('''
... <table border="1">
...   <tr>
...     <th>A</th>
...     <th>B</th>
...   </tr>
...   <tr>
...     <td rowspan="2">C</td>
...     <td rowspan="1">D</td>
...   </tr>
...   <tr>
...     <td>E</td>
...     <td>F</td>
...   </tr>
...   <tr>
...     <td>G</td>
...     <td>H</td>
...   </tr>
... </table>''', 'html.parser')
>>> pprint(table_to_2d(table1.table), width=30)
[['A', 'B', None],
 ['C', 'D', None],
 ['C', 'E', 'F'],
 ['G', 'H', None]]

And the second:

>>> table2 = BeautifulSoup('''
... <table border="1">
...   <tr>
...     <th>A</th>
...     <th>B</th>
...   </tr>
...   <tr>
...     <td rowspan="2">C</td>
...     <td rowspan="2">D</td>
...   </tr>
...   <tr>
...     <td>E</td>
...     <td>F</td>
...   </tr>
...   <tr>
...     <td>G</td>
...     <td>H</td>
...   </tr>
... </table>
... ''', 'html.parser')
>>> pprint(table_to_2d(table2.table), width=30)
[['A', 'B', None, None],
 ['C', 'D', None, None],
 ['C', 'D', 'E', 'F'],
 ['G', 'H', None, None]]

Last but not least, the code correctly handles spans that extend beyond the actual table, and "0" spans (extending to the ends), like in the following example:

<table border="1">
  <tr>
    <td rowspan="3">A</td>
    <td rowspan="0">B</td>
    <td>C</td>
    <td colspan="2">D</td>
  </tr>
  <tr>
    <td colspan="0">E</td>
  </tr>
</table>

There are two rows of 4 cells, even though the rowspan and colspan values would have you believe there could be 3 and 5:

+---+---+---+---+
|   |   | C | D |
| A | B +---+---+
|   |   |   E   |
+---+---+-------+

Such overspanning is handled just like the browser would; they are ignored, and the 0 spans extend to the remaining rows or columns:

>>> span_demo = BeautifulSoup('''
... <table border="1">
...   <tr>
...     <td rowspan="3">A</td>
...     <td rowspan="0">B</td>
...     <td>C</td>
...     <td colspan="2">D</td>
...   </tr>
...   <tr>
...     <td colspan="0">E</td>
...   </tr>
... </table>''', 'html.parser')
>>> pprint(table_to_2d(span_demo.table), width=30)
[['A', 'B', 'C', 'D'],
 ['A', 'B', 'E', 'E']]
like image 104
Martijn Pieters Avatar answered Nov 01 '22 16:11

Martijn Pieters


Important to note, that Martijn Pieters solution does not account for the case of cells having rowspan and colspan attribute simultaneosly. E.g.

<table border="1">
    <tr>
        <td rowspan="3" colspan="3">A</td>
        <td>B</td>
        <td>C</td>
        <td>D</td>
    </tr>
    <tr>
        <td colspan="3">E</td>
    </tr>
    <tr>
        <td colspan="1">E</td>
        <td>C</td>
        <td>C</td>
    </tr>
    <tr>
        <td colspan="1">E</td>
        <td>C</td>
        <td>C</td>
        <td>C</td>
        <td>C</td>
        <td>C</td>
    </tr>
</table>

This table renders to

+-----------+---+---+---+
| A         | B | C | D |
|           +---+---+---+
|           | E         |
|           +---+---+---+
|           | E | C | C |
+---+---+---+---+---+---+
| E | C | C | C | C | C |
+---+---+---+---+---+---+

but if we apply the function we get

[['A', 'A', 'A', 'B', 'C', 'D'],
 ['A', 'E', 'E', 'E', None, None],
 ['A', 'E', 'C', 'C', None, None],
 ['E', 'C', 'C', 'C', 'C', 'C']]

There may be some edge cases, but extending rowspan bookkeeping to cells in the product of rowspan and colspan, i.e.

   for drow, dcol in product(range(rowspan), range(colspan)):
            try:
                table[row + drow][col + dcol] = value
                rowspans[col + dcol] = rowspan
            except IndexError:
                # rowspan or colspan outside the confines of the table
                pass

seems to work on examples in this thread, and for the table above it will output

[['A', 'A', 'A', 'B', 'C', 'D'],
 ['A', 'A', 'A', 'E', 'E', 'E'],
 ['A', 'A', 'A', 'E', 'C', 'C'],
 ['E', 'C', 'C', 'C', 'C', 'C']]
like image 3
hellpanderr Avatar answered Nov 01 '22 17:11

hellpanderr