Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

BeautifulSoup, a dictionary from an HTML table

I am trying to scrape table data from a website.

Here is a simple example table:

t = '<html><table>' +\
    '<tr><td class="label"> a </td> <td> 1 </td></tr>' +\
    '<tr><td class="label"> b </td> <td> 2 </td></tr>' +\
    '<tr><td class="label"> c </td> <td> 3 </td></tr>' +\
    '<tr><td class="label"> d </td> <td> 4 </td></tr>' +\
    '</table></html>'

Desired parse result is {' a ': ' 1 ', ' b ': ' 2 ', ' c ': ' 3 ', ' d ' : ' 4' }


This is my closest attempt so far:

for tr in s.findAll('tr'):
  k, v = BeautifulSoup(str(tr)).findAll('td')
  d[str(k)] = str(v)

Result is:

{'<td class="label"> a </td>': '<td> 1 </td>', '<td class="label"> d </td>': '<td> 4 </td>', '<td class="label"> b </td>': '<td> 2 </td>', '<td class="label"> c </td>': '<td> 3 </td>'}

I'm aware of the text=True parameter of findAll() but I'm not getting the expected results when I use it.

I'm using python 2.6 and BeautifulSoup3.

like image 661
jon Avatar asked Aug 10 '12 12:08

jon


People also ask

Can BeautifulSoup parse HTML?

The following code opens an MHTML file, walks through all the parts in the file, uses BeautifulSoup4 to parse parts that have content type text/html , iterates through all the tables in the body, parses each table using html_table_extractor, and prints it out.

Can BeautifulSoup handle broken HTML?

BeautifulSoup is a Python package that parses broken HTML, just like lxml supports it based on the parser of libxml2.

What is BeautifulSoup prettify?

The prettify() method will turn a Beautiful Soup parse tree into a nicely formatted Unicode string, with a separate line for each tag and each string: Python3.


1 Answers

Try this:

from BeautifulSoup import BeautifulSoup, Comment

t = '<html><table>' +\
    '<tr><td class="label"> a </td> <td> 1 </td></tr>' +\
    '<tr><td class="label"> b </td> <td> 2 </td></tr>' +\
    '<tr><td class="label"> c </td> <td> 3 </td></tr>' +\
    '<tr><td class="label"> d </td> <td> 4 </td></tr>' +\
    '</table></html>'

bs = BeautifulSoup(t)

results = {}
for row in bs.findAll('tr'):
    aux = row.findAll('td')
    results[aux[0].string] = aux[1].string

print results
like image 59
mvillaress Avatar answered Sep 20 '22 14:09

mvillaress