I would like to iterate through all the tag I have in certain section of the html page. I applied the BeautifulSoup, but I could live without it and just the Selenium library. Let's say I have the following html code:
<table id="myBSTable">
<tr>
<th>Column A1</th>
<th>Column B1</th>
<th>Column C1</th>
<th>Column D1</th>
<th>Column E1</th>
</tr>
<tr>
<td data="First Column Data"></td>
<td data="Second Column Data"></td>
<td title="Title of the First Row">Value of Row 1</td>
<td>Beautiful 1</td>
<td>Soup 1</td>
</tr>
<tr>
<td></td>
<td data-g="Second Column Data"></td>
<td title="Title of the Second Row">Value of Row 2</td>
<td>Selenium 1</td>
<td>Rocks 1</td>
</tr>
<tr>
<td></td>
<td></td>
<td title="Title of the Third Row">Value of Row 3</td>
<td>Pyhon 1</td>
<td>Boulder 1</td>
</tr>
<tr>
<th>Column A2</th>
<th>Column B2</th>
<th>Column C2</th>
<th>Column D2</th>
<th>Column E2</th>
</tr>
<tr>
<td data="First Column Data"></td>
<td data="Second Column Data"></td>
<td title="Title of the First Row">Value of Row 1</td>
<td>Beautiful 2</td>
<td>Soup 2</td>
</tr>
<tr>
<td></td>
<td data-g="Second Column Data"></td>
<td title="Title of the Second Row">Value of Row 2</td>
<td>Selenium 2</td>
<td>Rocks 2</td>
</tr>
<tr>
<td></td>
<td></td>
<td title="Title of the Third Row">Value of Row 3 2</td>
<td>Pyhon 2</td>
<td>Boulder 2</td>
</tr>
</table>
I have this part working perfectly:
#Selenium libraries
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.common.exceptions import NoSuchElementException
#BeautifulSoup
from bs4 import BeautifulSoup
browser = webdriver.Firefox()
browser.get('http://urltoget.com')
table = browser.find_element_by_id('myBSTable')
bs_table = BeautifulSoup(table.get_attribute('innerHTML'), 'lxml')
#So far so good
rows = bs_table.findAll('tr')
for tr in rows:
#Here is where I need help
#I want to iterate through all tags
#but I don't know if is going to be a th or a td
#At the same time I need to do something
#if is a td or a th
And this is what I want to accomplish:
#The following is a pseudo code
for col in tr.tags:
print col.name, col.value
for attribute in col.attrs:
print " ", attribute.name, attribute.value
#End pseudo code
Thanks, Arty
A new tag can be created by calling BeautifulSoup's inbuilt function new_tag(). Inserting a new tag using the append() method : The new tag is appended to the end of the parent tag.
Step 1: The first step will be for scraping we need to import beautifulsoup module and get the request of the website we need to import the requests module. Step 2: The second step will be to request the URL call get method.
To find multiple tags, you can use the , CSS selector, where you can specify multiple tags separated by a comma , . To use a CSS selector, use the . select_one() method instead of . find() , or .
You may locate either td
or th
by specifying a list of tags to look for. In order to get all element attributes, use .attrs
attribute:
rows = bs_table.find_all('tr')
for row in rows:
cells = row.find_all(['td', 'th'])
for cell in cells:
print(cell.name, cell.attrs)
Alternative looping (action is at the bottom):
html='''<table id="myBSTable">
<tr>
<th>Column A1</th>
<th>Column B1</th>
<th>Column C1</th>
<th>Column D1</th>
<th>Column E1</th>
</tr>
<tr>
<td data="First Column Data"></td>
<td data="Second Column Data"></td>
<td title="Title of the First Row">Value of Row 1</td>
<td>Beautiful 1</td>
<td>Soup 1</td>
</tr>
<tr>
<td></td>
<td data-g="Second Column Data"></td>
<td title="Title of the Second Row">Value of Row 2</td>
<td>Selenium 1</td>
<td>Rocks 1</td>
</tr>
<tr>
<td></td>
<td></td>
<td title="Title of the Third Row">Value of Row 3</td>
<td>Pyhon 1</td>
<td>Boulder 1</td>
</tr>
<tr>
<th>Column A2</th>
<th>Column B2</th>
<th>Column C2</th>
<th>Column D2</th>
<th>Column E2</th>
</tr>
<tr>
<td data="First Column Data"></td>
<td data="Second Column Data"></td>
<td title="Title of the First Row">Value of Row 1</td>
<td>Beautiful 2</td>
<td>Soup 2</td>
</tr>
<tr>
<td></td>
<td data-g="Second Column Data"></td>
<td title="Title of the Second Row">Value of Row 2</td>
<td>Selenium 2</td>
<td>Rocks 2</td>
</tr>
<tr>
<td></td>
<td></td>
<td title="Title of the Third Row">Value of Row 3 2</td>
<td>Pyhon 2</td>
<td>Boulder 2</td>
</tr>
</table>'''
Soup = BeautifulSoup(html)
rows = Soup.findAll('tr')
for tr in rows:
for z in tr.children:
if z.name =='td':
do stuff1
if z.name == 'th':
do stuff2
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With