Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is there a clean way to get the n-th column of an html table using BeautifulSoup?

Say we look at the first table in a page, so:

table = BeautifulSoup(...).table

the rows can be scanned with a clean for-loop:

for row in table:
    f(row)

But for getting a single column things get messy.

My question: is there an elegant way to extract a single column, either by its position, or by its 'name' (i.e. text that appears in the first row of this column)?

like image 288
Benjamin Nitlehoo Avatar asked Apr 03 '11 20:04

Benjamin Nitlehoo


People also ask

How extract HTML table data from python?

For this, you can use different python libraries that help you extract content from the HTML table. One such method is available in the popular python Pandas library, it is called read_html(). The method accepts numerous arguments that allow you to customize how the table will be parsed.

What does Soup prettify do?

The prettify() method will turn a Beautiful Soup parse tree into a nicely formatted Unicode string, with a separate line for each tag and each string: Python3.


1 Answers

lxml is many times faster than BeautifulSoup, so you might want to use that.

from lxml.html import parse
doc = parse('http://python.org').getroot()
for row in doc.cssselect('table > tr'):
    for cell in row.cssselect('td:nth-child(3)'):
         print cell.text_content()

Or, instead of looping:

rows = [ row for row in doc.cssselect('table > tr') ]
cells = [ cell.text_content() for cell in rows.cssselect('td:nth-child(3)') ]
print cells
like image 160
Christopher O'Donnell Avatar answered Oct 03 '22 07:10

Christopher O'Donnell