Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

python, lxml and xpath - html table parsing

Tags:

python

xpath

lxml

I 'am new to lxml, quite new to python and could not find a solution to the following:

I need to import a few tables with 3 columns and an undefined number of rows starting at row 3.

When the second column of any row is empty, this row is discarded and the processing of the table is aborted.

The following code prints the table's data fine (but I'm unable to reuse the data afterwards):

from lxml.html import parse

def process_row(row):  
    for cell in row.xpath('./td'):  
        print cell.text_content()  
        yield cell.text_content()  

def process_table(table):  
    return [process_row(row) for row in table.xpath('./tr')]

doc = parse(url).getroot()  
tbl = doc.xpath("/html//table[2]")[0]  
data = process_table(tbl)  

This only prints the first column :(

for i in data:  
    print i.next()

The following only import the third row, and not the subsequent

tbl = doc.xpath("//body/table[2]//tr[position()>2]")[0]

Anyone knows a fancy solution to get all the data from row 3 into tbl and copy it into an array so it can be processed into a module with no lxml dependency?

Thanks in advance for your help, Alex

like image 755
user191131 Avatar asked Nov 05 '22 19:11

user191131


1 Answers

This is a generator:

def process_row(row):  
     for cell in row.xpath('./td'):  
         print cell.text_content()  
         yield cell.text_content() 

You're calling it as though you thought it returns a list. It doesn't. There are contexts in which it behaves like a list:

print [r for r in process_row(row)]

but that's only because a generator and a list both expose the same interface to for loops. Using it in a context where it gets evaluated just one time, e.g.:

return [process_row(row) for row in table.xpath('./tr')]

just calls a new instance of the generator once for each new value of row, returning the first result yielded.

So that's your first problem. Your second one is that you're expecting:

tbl = doc.xpath("//body/table[2]//tr[position()>2]")[0]

to give you the third and all subsequent rows, and it's only setting tbl to the third row. Well, the call to xpath is returning the third and all subsequent rows. It's the [0] at the end that's messing you up.

like image 120
Robert Rossney Avatar answered Nov 12 '22 17:11

Robert Rossney