Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

BeautifulSoup / Python - Convert HTML table to CSV and get href for one column

I am grabbing an HTML table with this code :

import csv
import urllib2
from bs4 import BeautifulSoup

with open('listing.csv', 'wb') as f:
    writer = csv.writer(f)
    for i in range(39):
        url = "file:///C:/projects/HTML/Export.htm".format(i)
        u = urllib2.urlopen(url)
        try:
            html = u.read()
        finally:
            u.close()
        soup=BeautifulSoup(html)
        for tr in soup.find_all('tr')[2:]:
            tds = tr.find_all('td')
            row = [elem.text.encode('utf-8') for elem in tds]
            writer.writerow(row)

Everything works perfectly, but I am trying to grab column 9 Href URL. It is currently giving me the txt value but not the URL.

Also, I have two tables in my HTML, anyway to skip the first table and just build the csv file using the second table?

Any help is very welcomed as I am new to Python and need this for a project I am automating a daily conversion.

Many thanks!

like image 281
RobertB Avatar asked Jan 15 '15 00:01

RobertB


2 Answers

You should access the href attribute of the a tag within the 8th td tag:

import csv
import urllib2
from bs4 import BeautifulSoup

records = []
for index in range(39):
    url = get_url(index)  # where is the formatting in your example happening?
    response = urllib2.urlopen(url)
    try:
        html = response.read()
    except Exception:
        raise
    else:
        my_parse(html)
    finally:
        try:
            response.close()
        except (UnboundLocalError, NameError):
            raise UnboundLocalError

def my_parse(html):
    soup = BeautifulSoup(html)
    table2 = soup.find_all('table')[1]
    for tr in table2.find_all('tr')[2:]:
        tds = tr.find_all('td')
        url = tds[8].a.get('href')
        records.append([elem.text.encode('utf-8') for elem in tds])
        # perhaps you want to update one of the elements of this last
        # record with the found url now?

# It's more efficient to write only once
with open('listing.csv', 'wb') as f:
    writer = csv.writer(f)
    writer.writerows(records)

I have taken the liberty to define a function get_url based on the index because your example rereads the same file every time, which is something I guess you don't actually want. I'll leave the implementation to you. Also, I've added some better exception handling.

At the same time, I've shown how you can access the 2nd table from that webpage's tables.

like image 126
Oliver W. Avatar answered Nov 17 '22 15:11

Oliver W.


Was fully able to get it working with the following code:

import csv
import urllib2
from bs4 import BeautifulSoup

#Grab second table from HTML
def my_parse(html):
    soup = BeautifulSoup(html)
    table2 = soup.find_all('table')[1]
    for tr in table2.find_all('tr')[2:]:
        tds = tr.find_all('td')
        url = tds[8].a.get('href')
    tds[8].a.replaceWith(url)
        records.append([elem.text.encode('utf-8') for elem in tds])

records = []
#Read HTML file into memory
for index in range(39):
    url = "file:///C:/projects/HTML/Export.htm".format(index)
    response = urllib2.urlopen(url)
    try:
        html = response.read()
    except Exception:
        raise
    else:
        my_parse(html)
    finally:
        try:
            response.close()
        except (UnboundLocalError, NameError):
            raise UnboundLocalError

#Writing CSV file
with open('listing.csv', 'wb') as f:
    writer = csv.writer(f)
    writer.writerows(records)

Many thanks for all the help!!!!!

like image 41
RobertB Avatar answered Nov 17 '22 16:11

RobertB