Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Beautiful soup, html table parsing

I am currently having a bit of an issue trying to parse a table into an array.

I have a simple table (HERE) which I need to parse with BS4 and put the cell contents into an array. What makes things difficult here is the fact that the cells don't contain text, but rather have images which have these titles: "Confirm" or "Site" - this is just user right's stuff. [I am skipping row one which contains the checkboxes, those i can extract without problems]

If you look at the fiddle above, all I need to do is to parse it in such a way that the resulting array becomes:

Array1[0] = User1,Confirm,Confirm,Site,Confirm
Array1[1] = User2,Confirm,Confirm,Confirm,Confirm
Array1[2] = User3,Confirm,Confirm,Confirm,Confirm
Array1[3] = User4,Confirm,Site,Site,Confirm

Which I can then do as I please with. Another complication is that sometimes the number of rows will vary so the script should be able to adapt to this and recursively create the array from the table.

At the moment StackOverflow is my only hope.. I have spent the last 10 hours doing this myself with little to no success and frankly I have lost hope. Closest I got to getting something out was extractin the enclosed tags, but could not parse further for some weird reason, perhaps it's bs4's nesting limitation? Could anyone have a look, please, and see if they can find a way of doing this? Or at least explain how to get there?

var explanations: rightml - the soup for the table.

allusers = []
rows = rightml.findAll('tr')
for tr in rows:
    cols = tr.findAll('td')
    for td in cols:
        if (td.find(title="Group")) or (td.find(title="User")):
            text = ''.join(td.text.strip())
            allusers.append(text)
print allusers

gifrights = []

rows7 = rightml.findAll('td')
#print rows7
for tr7 in rows:
    cols7 = tr7.findAll('img')
    for td7 in cols7:
        if (td7.find(title="Confirm")) or (td7.find(title="Site")):
            text = ''.join(td7.text.strip())
            text2 = text.split(' ')
            print text2
            gifrights.append(text2)

I could be WAY off with this code.. but I gave it the ol' college try.

like image 528
Meh Avatar asked Dec 26 '22 16:12

Meh


1 Answers

Would something like this work:

rows = soup.find('tbody').findAll('tr')

for row in rows:
    cells = row.findAll('td')

    output = []

    for i, cell in enumerate(cells):
        if i == 0:
            output.append(cell.text.strip())
        elif cell.find('img'):
            output.append(cell.find('img')['title'])
        elif cell.find('input'):
            output.append(cell.find('input')['value'])
    print output

This outputs the following:

[u'Logged-in users', u'True', u'True', u'True', u'True']
[u'User 1', u'Confirm', u'Confirm', u'Site', u'Confirm']
[u'User 2', u'Confirm', u'Confirm', u'Confirm', u'Confirm']
[u'User 3', u'Confirm', u'Confirm', u'Confirm', u'Confirm']
[u'User 4', u'Confirm', u'Site', u'Site', u'Confirm']
like image 147
Nathan Villaescusa Avatar answered Dec 31 '22 14:12

Nathan Villaescusa