I am trying to scrape table data from a website.
Here is a simple example table:
t = '<html><table>' +\
'<tr><td class="label"> a </td> <td> 1 </td></tr>' +\
'<tr><td class="label"> b </td> <td> 2 </td></tr>' +\
'<tr><td class="label"> c </td> <td> 3 </td></tr>' +\
'<tr><td class="label"> d </td> <td> 4 </td></tr>' +\
'</table></html>'
Desired parse result is {' a ': ' 1 ', ' b ': ' 2 ', ' c ': ' 3 ', ' d ' : ' 4' }
This is my closest attempt so far:
for tr in s.findAll('tr'):
k, v = BeautifulSoup(str(tr)).findAll('td')
d[str(k)] = str(v)
Result is:
{'<td class="label"> a </td>': '<td> 1 </td>', '<td class="label"> d </td>': '<td> 4 </td>', '<td class="label"> b </td>': '<td> 2 </td>', '<td class="label"> c </td>': '<td> 3 </td>'}
I'm aware of the text=True
parameter of findAll()
but I'm not getting the expected results when I use it.
I'm using python 2.6 and BeautifulSoup3.
The following code opens an MHTML file, walks through all the parts in the file, uses BeautifulSoup4 to parse parts that have content type text/html , iterates through all the tables in the body, parses each table using html_table_extractor, and prints it out.
BeautifulSoup is a Python package that parses broken HTML, just like lxml supports it based on the parser of libxml2.
The prettify() method will turn a Beautiful Soup parse tree into a nicely formatted Unicode string, with a separate line for each tag and each string: Python3.
Try this:
from BeautifulSoup import BeautifulSoup, Comment
t = '<html><table>' +\
'<tr><td class="label"> a </td> <td> 1 </td></tr>' +\
'<tr><td class="label"> b </td> <td> 2 </td></tr>' +\
'<tr><td class="label"> c </td> <td> 3 </td></tr>' +\
'<tr><td class="label"> d </td> <td> 4 </td></tr>' +\
'</table></html>'
bs = BeautifulSoup(t)
results = {}
for row in bs.findAll('tr'):
aux = row.findAll('td')
results[aux[0].string] = aux[1].string
print results
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With