Say we look at the first table in a page, so:
table = BeautifulSoup(...).table
the rows can be scanned with a clean for-loop:
for row in table:
f(row)
But for getting a single column things get messy.
My question: is there an elegant way to extract a single column, either by its position, or by its 'name' (i.e. text that appears in the first row of this column)?
For this, you can use different python libraries that help you extract content from the HTML table. One such method is available in the popular python Pandas library, it is called read_html(). The method accepts numerous arguments that allow you to customize how the table will be parsed.
The prettify() method will turn a Beautiful Soup parse tree into a nicely formatted Unicode string, with a separate line for each tag and each string: Python3.
lxml is many times faster than BeautifulSoup, so you might want to use that.
from lxml.html import parse
doc = parse('http://python.org').getroot()
for row in doc.cssselect('table > tr'):
for cell in row.cssselect('td:nth-child(3)'):
print cell.text_content()
Or, instead of looping:
rows = [ row for row in doc.cssselect('table > tr') ]
cells = [ cell.text_content() for cell in rows.cssselect('td:nth-child(3)') ]
print cells
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With