Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Extracting text from a table with python and lxml

Tags:

python

lxml

I recently saw that another user had asked a question about extracting information from a web table Extracting information from a webpage with python. The answer from ekhumoro works great on the page that the other user asked. See below.

from urllib2 import urlopen
from lxml import etree

url = 'http://www.uscho.com/standings/division-i-men/2011-2012/'

tree = etree.HTML(urlopen(url).read())

for section in tree.xpath('//section[starts-with(@id, "section_")]'):
    print section.xpath('h3[1]/text()')[0]
    for row in section.xpath('table/tbody/tr'):
        cols = row.xpath('td//text()')
        print '  ', cols[0].ljust(25), ' '.join(cols[1:])
    print

My problem is using this code as a guide to parse this page http://www.uscho.com/rankings/d-i-mens-poll/ . Using the following changes I can only get h1 and h3 to print.

Input

url = 'http://www.uscho.com/rankings/d-i-mens-poll/'
tree = etree.HTML(urlopen(url).read())

for section in tree.xpath('//section[starts-with(@id, "rankings")]'):
    print section.xpath('h1[1]/text()')[0]
    print section.xpath('h3[1]/text()')[0]
    for row in section.xpath('table/tbody/tr'):
        cols = row.xpath('td/b/text()')
        print '  ', cols[0].ljust(25), ' '.join(cols[1:])
    print

Output

USCHO.com Division I Men's Poll
December 12, 2011

The structure of the table seems to be the same so I'm at a loss as to why I can't use similar code. I'm just a mechanical engineer in way over my head. Any help is appreciated.

like image 931
drivendaily Avatar asked Feb 19 '26 05:02

drivendaily


1 Answers

lxml is great, but if you're not familiar with xpath, I recommend you BeautifulSoup:

from urllib2 import urlopen
from BeautifulSoup import BeautifulSoup

url = 'http://www.uscho.com/rankings/d-i-mens-poll/'
soup = BeautifulSoup(urlopen(url).read())

section = soup.find('section', id='rankings')
h1 = section.find('h1')
print h1.text
h3 = section.find('h3')
print h3.text
print

rows = section.find('table').findAll('tr')[1:-1]
for row in rows:
    columns = [data.text for data in row.findAll('td')[1:]]
    print '{0:20} {1:4} {2:>6} {3:>4}'.format(*columns)

The output for this script is:

USCHO.com Division I Men's Poll
December 12, 2011

Minnesota-Duluth     (49) 12-3-3  999
Minnesota                 14-5-1  901
Boston College            12-6-0  875
Ohio State           ( 1) 13-4-1  848
Merrimack                 10-2-2  844
Notre Dame                11-6-3  667
Colorado College           9-5-0  650
Western Michigan           9-4-5  647
Boston University         10-5-1  581
Ferris State              11-6-1  521
Union                      8-3-5  510
Colgate                   11-4-2  495
Cornell                    7-3-1  347
Denver                     7-6-3  329
Michigan State            10-6-2  306
Lake Superior             11-7-2  258
Massachusetts-Lowell      10-5-0  251
North Dakota               9-8-1   88
Yale                       6-5-1   69
Michigan                   9-8-3   62
like image 151
jcollado Avatar answered Feb 20 '26 17:02

jcollado



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!