Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python Beautiful Soup parse a table with a specific ID

I'm trying to get the data from a table with a specific ID which I know. For some reason, the code keeps giving me a None result.

From the HTML code I'm trying to parse:

<table cellspacing="0" cellpadding="3" border="0" id="ctl00_SPWebPartManager1_g_c001c0d9_0cb8_4b0f_b75a_7cc3b6f7d790_ctl00_HistoryData1_gridHistoryData_DataGrid1" style="width:100%;border-collapse:collapse;">
    <tr class="gridHeader" valign="top">
        <td class="titleGridRegNoB" align="center" valign="top"><span dir=RTL>שווי שוק (אלפי ש"ח)</span></td>
        <td class="titleGridReg" align="center" valign="top">הון רשום למסחר</td>
        <td class="titleGridReg" align="center" valign="top">שער נמוך</td><td class="titleGridReg" align="center" valign="top">שער גבוה</td>
        <td class="titleGridReg" align="center" valign="top">שער בסיס</td>
        <td class="titleGridReg" align="center" valign="top">שער פתיחה</td><td class="titleGridReg" align="center" valign="top"><span dir="rtl">שער נעילה (באגורות)</span></td>
        <td class="titleGridReg" align="center" valign="top">שער נעילה מתואם</td><td class="titleGridReg" align="center" valign="top">תאריך</td>
    </tr>
    <tr onmouseover="this.style.backgroundColor='#FDF1D7'" onmouseout="this.style.backgroundColor='#ffffff'">

... And so on

My code:

html = br.response().read()
soup = BeautifulSoup(html)

table = soup.find(lambda tag: tag.name=='table' and tag.has_key('id') and tag['id']=="ctl00_SPWebPartManager1_g_c001c0d9_0cb8_4b0f_b75a_7cc3b6f7d790_ctl00_HistoryData1_gridHistoryData_DataGrid1")
rows = table.findAll(lambda tag: tag.name=='tr')

In [100]: print table
None
like image 897
erantdo Avatar asked Oct 25 '13 13:10

erantdo


People also ask

How do I find a specific element with BeautifulSoup?

BeautifulSoup has a limited support for CSS selectors, but covers most commonly used ones. Use select() method to find multiple elements and select_one() to find a single element.

What is the difference between Find_all () and find () in BeautifulSoup?

find is used for returning the result when the searched element is found on the page. find_all is used for returning all the matches after scanning the entire document. It is used for getting merely the first tag of the incoming HTML object for which condition is satisfied.

What is the function you would use in BeautifulSoup to find all the TR tags?

We could just use find_all() again to find all the tr tags, yes, but we can also to iterate over these tags in a more straight forward manner. The children attribute returns an iterable object with all the tags right beneath the parent tag, which is table , therefore it returns all the tr tags.


2 Answers

From the documentation:

table = soup.find('table', id="ctl00_SPWebPartManager1_g_c001c0d9_0cb8_4b0f_b75a_7cc3b6f7d790_ctl00_HistoryData1_gridHistoryData_DataGrid1")

And the for the rows line:

rows = table.findAll('tr')

For the encoding problem, try decoding it from utf-8, and re-encode it.

html = br.response().read().decode('utf-8')
soup = BeautifulSoup(html.encode('utf-8'))
like image 143
aIKid Avatar answered Nov 12 '22 02:11

aIKid


Improving upon aiKid's answer:

# coding=utf-8
from bs4 import BeautifulSoup

html = u"""
<table cellspacing="0" cellpadding="3" border="0" id="ctl00_SPWebPartManager1_g_c001c0d9_0cb8_4b0f_b75a_7cc3b6f7d790_ctl00_HistoryData1_gridHistoryData_DataGrid1" style="width:100%;border-collapse:collapse;">
                            <tr class="gridHeader" valign="top">
                                <td class="titleGridRegNoB" align="center" valign="top"><span dir=RTL>שווי שוק (אלפי ש"ח)</span></td><td class="titleGridReg" align="center" valign="top">הון רשום למסחר</td><td class="titleGridReg" align="center" valign="top">שער נמוך</td><td class="titleGridReg" align="center" valign="top">שער גבוה</td><td class="titleGridReg" align="center" valign="top">שער בסיס</td><td class="titleGridReg" align="center" valign="top">שער פתיחה</td><td class="titleGridReg" align="center" valign="top"><span dir="rtl">שער נעילה (באגורות)</span>
</td><td class="titleGridReg" align="center" valign="top">שער נעילה מתואם</td><td class="titleGridReg" align="center" valign="top">תאריך</td>
                            </tr><tr onmouseover="this.style.backgroundColor='#FDF1D7'" onmouseout="this.style.backgroundColor='#ffffff'">
"""

soup = BeautifulSoup(html)
print soup.find_all("table",
                    id="ctl00_SPWebPartManager1_g_c001c0d9_0cb8_4b0f_b75a_7cc3b6f7d790_ctl00_HistoryData1_gridHistoryData_DataGrid1")

Since you're working with UTF-8 data, you need to set the string as a unicode string like so u"""(...)""". All you need to do to work with unicode is this:

br.response().read().decode('utf-8')

The above will give you an ASCII string, that you can later encode into unicode. Like, say the string is stored in html, and you can encode it back to unicode using html.encode("utf-8"). If you do this, you do not need to put the u in front of anything. You can treat everything as a regular string again.

like image 30
Games Brainiac Avatar answered Nov 12 '22 01:11

Games Brainiac