I am trying to extract the first and third columns of this data table using BeautifulSoup. From looking at the HTML the first column has a <th>
tag. The other column of interest has as <td>
tag. In any case, all I've been able to get out is a list of the column with the tags. But, I just want the text.
table
is already a list so I can't use findAll(text=True)
. I'm not sure how to get the listing of the first column in another form.
from BeautifulSoup import BeautifulSoup
from sys import argv
import re
filename = argv[1] #get HTML file as a string
html_doc = ''.join(open(filename,'r').readlines())
soup = BeautifulSoup(html_doc)
table = soup.findAll('table')[0].tbody.th.findAll('th') #The relevant table is the first one
print table
You can try this code:
import urllib2
from BeautifulSoup import BeautifulSoup
url = "http://www.samhsa.gov/data/NSDUH/2k10State/NSDUHsae2010/NSDUHsaeAppC2010.htm"
soup = BeautifulSoup(urllib2.urlopen(url).read())
for row in soup.findAll('table')[0].tbody.findAll('tr'):
first_column = row.findAll('th')[0].contents
third_column = row.findAll('td')[2].contents
print first_column, third_column
As you can see the code just connects to the url and gets the html, and the BeautifulSoup finds the first table, then all the 'tr' and selects the first column, which is the 'th', and the third column, which is a 'td'.
In addition to @jonhkr's answer I thought I'd post an alternate solution I came up with.
#!/usr/bin/python
from BeautifulSoup import BeautifulSoup
from sys import argv
filename = argv[1]
#get HTML file as a string
html_doc = ''.join(open(filename,'r').readlines())
soup = BeautifulSoup(html_doc)
table = soup.findAll('table')[0].tbody
data = map(lambda x: (x.findAll(text=True)[1],x.findAll(text=True)[5]),table.findAll('tr'))
print data
Unlike jonhkr's answer, which dials into the webpage, mine assumes that you have it save on your computer and pass it as a command line argument. For example:
python file.py table.html
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With