<p>I would like to iterate through all the tag I have in certain section of the html page. I applied the BeautifulSoup, but I could live without it and just the Selenium library. Let's say I have the following html code:</p> <pre class="prettyprint"><code><table id="myBSTable"> <tr> <th>Column A1</th> <th>Column B1</th> <th>Column C1</th> <th>Column D1</th> <th>Column E1</th> </tr> <tr> <td data="First Column Data"></td> <td data="Second Column Data"></td> <td title="Title of the First Row">Value of Row 1</td> <td>Beautiful 1</td> <td>Soup 1</td> </tr> <tr> <td></td> <td data-g="Second Column Data"></td> <td title="Title of the Second Row">Value of Row 2</td> <td>Selenium 1</td> <td>Rocks 1</td> </tr> <tr> <td></td> <td></td> <td title="Title of the Third Row">Value of Row 3</td> <td>Pyhon 1</td> <td>Boulder 1</td> </tr> <tr> <th>Column A2</th> <th>Column B2</th> <th>Column C2</th> <th>Column D2</th> <th>Column E2</th> </tr> <tr> <td data="First Column Data"></td> <td data="Second Column Data"></td> <td title="Title of the First Row">Value of Row 1</td> <td>Beautiful 2</td> <td>Soup 2</td> </tr> <tr> <td></td> <td data-g="Second Column Data"></td> <td title="Title of the Second Row">Value of Row 2</td> <td>Selenium 2</td> <td>Rocks 2</td> </tr> <tr> <td></td> <td></td> <td title="Title of the Third Row">Value of Row 3 2</td> <td>Pyhon 2</td> <td>Boulder 2</td> </tr> </table> </code></pre> <p>I have this part working perfectly:</p> <pre class="prettyprint"><code>#Selenium libraries from selenium import webdriver from selenium.webdriver.common.keys import Keys from selenium.common.exceptions import NoSuchElementException #BeautifulSoup from bs4 import BeautifulSoup browser = webdriver.Firefox() browser.get('http://urltoget.com') table = browser.find_element_by_id('myBSTable') bs_table = BeautifulSoup(table.get_attribute('innerHTML'), 'lxml') #So far so good rows = bs_table.findAll('tr') for tr in rows: #Here is where I need help #I want to iterate through all tags #but I don't know if is going to be a th or a td #At the same time I need to do something #if is a td or a th </code></pre> <p>And this is what I want to accomplish:</p> <pre class="prettyprint"><code> #The following is a pseudo code for col in tr.tags: print col.name, col.value for attribute in col.attrs: print " ", attribute.name, attribute.value #End pseudo code </code></pre> <p>Thanks, Arty</p>

<p>You may locate either <code>td</code> or <code>th</code> by specifying a list of tags to look for. In order to get all element attributes, use <code>.attrs</code> attribute:</p> <pre class="prettyprint"><code>rows = bs_table.find_all('tr') for row in rows: cells = row.find_all(['td', 'th']) for cell in cells: print(cell.name, cell.attrs) </code></pre>

<p>Alternative looping (action is at the bottom):</p> <pre class="prettyprint"><code>html='''<table id="myBSTable"> <tr> <th>Column A1</th> <th>Column B1</th> <th>Column C1</th> <th>Column D1</th> <th>Column E1</th> </tr> <tr> <td data="First Column Data"></td> <td data="Second Column Data"></td> <td title="Title of the First Row">Value of Row 1</td> <td>Beautiful 1</td> <td>Soup 1</td> </tr> <tr> <td></td> <td data-g="Second Column Data"></td> <td title="Title of the Second Row">Value of Row 2</td> <td>Selenium 1</td> <td>Rocks 1</td> </tr> <tr> <td></td> <td></td> <td title="Title of the Third Row">Value of Row 3</td> <td>Pyhon 1</td> <td>Boulder 1</td> </tr> <tr> <th>Column A2</th> <th>Column B2</th> <th>Column C2</th> <th>Column D2</th> <th>Column E2</th> </tr> <tr> <td data="First Column Data"></td> <td data="Second Column Data"></td> <td title="Title of the First Row">Value of Row 1</td> <td>Beautiful 2</td> <td>Soup 2</td> </tr> <tr> <td></td> <td data-g="Second Column Data"></td> <td title="Title of the Second Row">Value of Row 2</td> <td>Selenium 2</td> <td>Rocks 2</td> </tr> <tr> <td></td> <td></td> <td title="Title of the Third Row">Value of Row 3 2</td> <td>Pyhon 2</td> <td>Boulder 2</td> </tr> </table>''' Soup = BeautifulSoup(html) rows = Soup.findAll('tr') for tr in rows: for z in tr.children: if z.name =='td': do stuff1 if z.name == 'th': do stuff2 </code></pre>

Python BeautifulSoup, iterating through tags and attributes

Tags:

python

html

beautifulsoup

selenium

I would like to iterate through all the tag I have in certain section of the html page. I applied the BeautifulSoup, but I could live without it and just the Selenium library. Let's say I have the following html code:

<table id="myBSTable">   
    <tr>
        <th>Column A1</th>
        <th>Column B1</th>
        <th>Column C1</th>
        <th>Column D1</th>
        <th>Column E1</th>
    </tr>
    <tr>
        <td data="First Column Data"></td>
        <td data="Second Column Data"></td>
        <td title="Title of the First Row">Value of Row 1</td>
        <td>Beautiful 1</td>
        <td>Soup 1</td>
    </tr>
    <tr>
        <td></td>
        <td data-g="Second Column Data"></td>
        <td title="Title of the Second Row">Value of Row 2</td>
        <td>Selenium 1</td>
        <td>Rocks 1</td>
    </tr>
    <tr>
        <td></td>
        <td></td>
        <td title="Title of the Third Row">Value of Row 3</td>
        <td>Pyhon 1</td>
        <td>Boulder 1</td>
    </tr>
    <tr>
        <th>Column A2</th>
        <th>Column B2</th>
        <th>Column C2</th>
        <th>Column D2</th>
        <th>Column E2</th>
    </tr>
    <tr>
        <td data="First Column Data"></td>
        <td data="Second Column Data"></td>
        <td title="Title of the First Row">Value of Row 1</td>
        <td>Beautiful 2</td>
        <td>Soup 2</td>
    </tr>
    <tr>
        <td></td>
        <td data-g="Second Column Data"></td>
        <td title="Title of the Second Row">Value of Row 2</td>
        <td>Selenium 2</td>
        <td>Rocks 2</td>
    </tr>
    <tr>
        <td></td>
        <td></td>
        <td title="Title of the Third Row">Value of Row 3 2</td>
        <td>Pyhon 2</td>
        <td>Boulder 2</td>
    </tr>
</table>

I have this part working perfectly:

#Selenium libraries
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.common.exceptions import NoSuchElementException

#BeautifulSoup
from bs4 import BeautifulSoup

browser = webdriver.Firefox()
browser.get('http://urltoget.com')   

table = browser.find_element_by_id('myBSTable')
bs_table = BeautifulSoup(table.get_attribute('innerHTML'), 'lxml')
#So far so good
rows = bs_table.findAll('tr')
for tr in rows:
    #Here is where I need help
    #I want to iterate through all tags
    #but I don't know if is going to be a th or a td
    #At the same time I need to do something
    #if is a td or a th

And this is what I want to accomplish:

    #The following is a pseudo code
    for col in tr.tags:
        print col.name, col.value
        for attribute in col.attrs:
            print "    ", attribute.name, attribute.value
    #End pseudo code

Thanks, Arty

761

asked Jun 23 '17 14:06

Arty

2 Answers

You may locate either td or th by specifying a list of tags to look for. In order to get all element attributes, use .attrs attribute:

rows = bs_table.find_all('tr')
for row in rows:
    cells = row.find_all(['td', 'th'])
    for cell in cells:
        print(cell.name, cell.attrs)

answered Oct 12 '22 02:10

alecxe

Alternative looping (action is at the bottom):

html='''<table id="myBSTable">   
    <tr>
        <th>Column A1</th>
        <th>Column B1</th>
        <th>Column C1</th>
        <th>Column D1</th>
        <th>Column E1</th>
    </tr>
    <tr>
        <td data="First Column Data"></td>
        <td data="Second Column Data"></td>
        <td title="Title of the First Row">Value of Row 1</td>
        <td>Beautiful 1</td>
        <td>Soup 1</td>
    </tr>
    <tr>
        <td></td>
        <td data-g="Second Column Data"></td>
        <td title="Title of the Second Row">Value of Row 2</td>
        <td>Selenium 1</td>
        <td>Rocks 1</td>
    </tr>
    <tr>
        <td></td>
        <td></td>
        <td title="Title of the Third Row">Value of Row 3</td>
        <td>Pyhon 1</td>
        <td>Boulder 1</td>
    </tr>
    <tr>
        <th>Column A2</th>
        <th>Column B2</th>
        <th>Column C2</th>
        <th>Column D2</th>
        <th>Column E2</th>
    </tr>
    <tr>
        <td data="First Column Data"></td>
        <td data="Second Column Data"></td>
        <td title="Title of the First Row">Value of Row 1</td>
        <td>Beautiful 2</td>
        <td>Soup 2</td>
    </tr>
    <tr>
        <td></td>
        <td data-g="Second Column Data"></td>
        <td title="Title of the Second Row">Value of Row 2</td>
        <td>Selenium 2</td>
        <td>Rocks 2</td>
    </tr>
    <tr>
        <td></td>
        <td></td>
        <td title="Title of the Third Row">Value of Row 3 2</td>
        <td>Pyhon 2</td>
        <td>Boulder 2</td>
    </tr>
</table>'''

Soup = BeautifulSoup(html)

rows = Soup.findAll('tr')
for tr in rows:
    for z in tr.children:
        if z.name =='td':
            do stuff1
        if z.name == 'th':
            do stuff2

answered Oct 12 '22 01:10

Dmitriy Fialkovskiy

Related questions
                            
                                How does __del__() interfere with garbage collection?
                            
                                data frame of tfidf with Python
                            
                                AttributeError: 'ManyToManyDescriptor' object has no attribute 'all' - django
                            
                                regular expression for hexadecimal
                            
                                Error in importing geopandas
                            
                                How to release the GIL in Cython for a multithreaded C++ class?
                            
                                scikit-learn classification on soft labels
                            
                                Get SharePoint List with Python
                            
                                Run / Debug a Django application's UnitTests from the mouse right click context menu in PyCharm Community Edition?
                            
                                Update values of a list of dictionaries in python
                            
                                Best way to construct a binary tree from a list in python
                            
                                tensorflow: check if a scalar boolean tensor is True
                            
                                Python output above the last printed line
                            
                                Pandas: Fill NaNs with next non-NaN / # consecutive NaNs
                            
                                How to put all legend entries on one line?
                            
                                How do I use an InfiniBand network with Dask?
                            
                                Matplotlib change colormap tab20 to have three colors
                            
                                How to annotate Django view's methods?
                            
                                How to Add item to string_set on Dynamodb with Boto3
                            
                                BeautifulSoup.find_all() method not working with namespaced tags

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With