Extracting selected columns from a table using BeautifulSoup

Question

I am trying to extract the first and third columns of this data table using BeautifulSoup. From looking at the HTML the first column has a <th> tag. The other column of interest has as <td> tag. In any case, all I've been able to get out is a list of the column with the tags. But, I just want the text.

table is already a list so I can't use findAll(text=True). I'm not sure how to get the listing of the first column in another form.

from BeautifulSoup import BeautifulSoup
from sys import argv
import re

filename = argv[1] #get HTML file as a string
html_doc = ''.join(open(filename,'r').readlines())
soup = BeautifulSoup(html_doc)
table = soup.findAll('table')[0].tbody.th.findAll('th') #The relevant table is the first one

print table

jonhkr · Accepted Answer

You can try this code:

import urllib2
from BeautifulSoup import BeautifulSoup

url = "http://www.samhsa.gov/data/NSDUH/2k10State/NSDUHsae2010/NSDUHsaeAppC2010.htm"
soup = BeautifulSoup(urllib2.urlopen(url).read())

for row in soup.findAll('table')[0].tbody.findAll('tr'):
    first_column = row.findAll('th')[0].contents
    third_column = row.findAll('td')[2].contents
    print first_column, third_column

As you can see the code just connects to the url and gets the html, and the BeautifulSoup finds the first table, then all the 'tr' and selects the first column, which is the 'th', and the third column, which is a 'td'.

mac389 · Answer

In addition to @jonhkr's answer I thought I'd post an alternate solution I came up with.

 #!/usr/bin/python

 from BeautifulSoup import BeautifulSoup
 from sys import argv

 filename = argv[1]
 #get HTML file as a string
 html_doc = ''.join(open(filename,'r').readlines())
 soup = BeautifulSoup(html_doc)
 table = soup.findAll('table')[0].tbody

 data = map(lambda x: (x.findAll(text=True)[1],x.findAll(text=True)[5]),table.findAll('tr'))
 print data

Unlike jonhkr's answer, which dials into the webpage, mine assumes that you have it save on your computer and pass it as a command line argument. For example:

python file.py table.html

Extracting selected columns from a table using BeautifulSoup

Tags:

python

html-parsing

beautifulsoup

mac389

2 Answers

jonhkr

mac389

Recent Activity

Donate For Us

Extracting selected columns from a table using BeautifulSoup

Tags:

python

html-parsing

beautifulsoup

mac389

2 Answers

jonhkr

mac389

Related questions

Recent Activity

Donate For Us