<p>I have a large table from the web, accessed via requests and parsed with BeautifulSoup. Part of it looks something like this:</p> <pre class="prettyprint"><code><table> <tbody> <tr> <td>265</td> <td> <a href="/j/jones03.shtml">Jones</a>Blue</td> <td>29</td> </tr> <tr > <td>266</td> <td> <a href="/s/smith01.shtml">Smith</a></td> <td>34</td> </tr> </tbody> </table> </code></pre> <p>When I convert this to pandas using <code>pd.read_html(tbl)</code> the output is like this:</p> <pre class="prettyprint"><code> 0 1 2 0 265 JonesBlue 29 1 266 Smith 34 </code></pre> <p>I need to keep the information in the <code><A HREF ... ></code> tag, since the unique identifier is stored in the link. That is, the table should look like this:</p> <pre class="prettyprint"><code> 0 1 2 0 265 jones03 29 1 266 smith01 34 </code></pre> <p>I'm fine with various other outputs (for example, <code>jones03 Jones</code> would be even more helpful) but the unique ID is critical. </p> <p>Other cells also have html tags in them, and in general I don't want those to be saved, but if that's the only way of getting the uid I'm OK with keeping those tags and cleaning them up later, if I have to.</p> <p>Is there a simple way of accessing this information?</p>

<p>Since this parsing job requires the extraction of both text and attribute values, it can not be done entirely "out-of-the-box" by a function such as <code>pd.read_html</code>. Some of it has to be done by hand.</p> <p>Using lxml, you could extract the attribute values with XPath:</p> <pre class="prettyprint"><code>import lxml.html as LH import pandas as pd content = ''' <table> <tbody> <tr> <td>265</td> <td> <a href="/j/jones03.shtml">Jones</a>Blue</td> <td >29</td> </tr> <tr > <td>266</td> <td> <a href="/s/smith01.shtml">Smith</a></td> <td>34</td> </tr> </tbody> </table>''' table = LH.fromstring(content) for df in pd.read_html(content): df['refname'] = table.xpath('//tr/td/a/@href') df['refname'] = df['refname'].str.extract(r'([^./]+)[.]') print(df) </code></pre> <p>yields</p> <pre class="prettyprint"><code> 0 1 2 refname 0 265 JonesBlue 29 jones03 1 266 Smith 34 smith01 </code></pre> <hr> <p>The above may be useful since it requires only a few extra lines of code to add the <code>refname</code> column.</p> <p>But both <code>LH.fromstring</code> and <code>pd.read_html</code> parse the HTML. So it's efficiency could be improved by removing <code>pd.read_html</code> and parsing the table once with <code>LH.fromstring</code>:</p> <pre class="prettyprint"><code>table = LH.fromstring(content) # extract the text from `<td>` tags data = [[elt.text_content() for elt in tr.xpath('td')] for tr in table.xpath('//tr')] df = pd.DataFrame(data, columns=['id', 'name', 'val']) for col in ('id', 'val'): df[col] = df[col].astype(int) # extract the href attribute values df['refname'] = table.xpath('//tr/td/a/@href') df['refname'] = df['refname'].str.extract(r'([^./]+)[.]') print(df) </code></pre> <p>yields</p> <pre class="prettyprint"><code> id name val refname 0 265 JonesBlue 29 jones03 1 266 Smith 34 smith01 </code></pre>

<p>You could simply parse the table manually like this:</p> <pre class="prettyprint"><code>import BeautifulSoup import pandas as pd TABLE = """<table> <tbody> <tr> <td>265</td> <td <a href="/j/jones03.shtml">Jones</a>Blue</td> <td >29</td> </tr> <tr > <td>266</td> <td <a href="/s/smith01.shtml">Smith</a></td> <td>34</td> </tr> </tbody> </table>""" table = BeautifulSoup.BeautifulSoup(TABLE) records = [] for tr in table.findAll("tr"): trs = tr.findAll("td") record = [] record.append(trs[0].text) record.append(trs[1].a["href"]) record.append(trs[2].text) records.append(record) df = pd.DataFrame(data=records) df </code></pre> <p>which gives you</p> <pre class="prettyprint"><code> 0 1 2 0 265 /j/jones03.shtml 29 1 266 /s/smith01.shtml 34 </code></pre>

<p>You could use regular expressions to modify the text first and remove the html tags:</p> <pre class="prettyprint"><code>import re, pandas as pd tbl = """<table> <tbody> <tr> <td>265</td> <td> <a href="/j/jones03.shtml">Jones</a>Blue</td> <td>29</td> </tr> <tr > <td>266</td> <td> <a href="/s/smith01.shtml">Smith</a></td> <td>34</td> </tr> </tbody> </table>""" tbl = re.sub('<a.*?href="(.*?)">(.*?)</a>', '\\1 \\2', tbl) pd.read_html(tbl) </code></pre> <p>which gives you</p> <pre class="prettyprint"><code>[ 0 1 2 0 265 /j/jones03.shtml JonesBlue 29 1 266 /s/smith01.shtml Smith 34] </code></pre>

<p>This available now in Pandas 1.5.0+ using the extract_links parameter.</p> <pre class="prettyprint"><code>extract_links - possible options: {None, “all”, “header”, “body”, “footer”} </code></pre> <p>Table elements in the specified section(s) with tags will have their href extracted.</p> <ul> <li> <p>Documentation</p> </li> <li> <p>Example</p> <pre class="prettyprint"><code>html_table = """ <table> <tr> <th>GitHub</th> </tr> <tr> <td><a href="https://github.com/pandas-dev/pandas">pandas</a> </td> </tr> </table> """ df = pd.read_html( html_table, extract_links="all" )[0] </code></pre> </li> </ul>

HTML table to pandas table: Info inside html tags

Tags:

python

pandas

beautifulsoup

I have a large table from the web, accessed via requests and parsed with BeautifulSoup. Part of it looks something like this:

Click to copy

<table>
<tbody>
<tr>
<td>265</td>
<td> <a href="/j/jones03.shtml">Jones</a>Blue</td>
<td>29</td>
</tr>
<tr >
<td>266</td>
<td> <a href="/s/smith01.shtml">Smith</a></td>
<td>34</td>
</tr>
</tbody>
</table>

When I convert this to pandas using pd.read_html(tbl) the output is like this:

Click to copy

    0    1          2
 0  265  JonesBlue  29
 1  266  Smith      34

I need to keep the information in the <A HREF ... > tag, since the unique identifier is stored in the link. That is, the table should look like this:

Click to copy

    0    1        2
 0  265  jones03  29
 1  266  smith01  34

I'm fine with various other outputs (for example, jones03 Jones would be even more helpful) but the unique ID is critical.

Other cells also have html tags in them, and in general I don't want those to be saved, but if that's the only way of getting the uid I'm OK with keeping those tags and cleaning them up later, if I have to.

Is there a simple way of accessing this information?

734

asked Aug 02 '15 11:08

iayork

4 Answers

Since this parsing job requires the extraction of both text and attribute values, it can not be done entirely "out-of-the-box" by a function such as pd.read_html. Some of it has to be done by hand.

Using lxml, you could extract the attribute values with XPath:

Click to copy

import lxml.html as LH
import pandas as pd

content = '''
<table>
<tbody>
<tr>
<td>265</td>
<td> <a href="/j/jones03.shtml">Jones</a>Blue</td>
<td >29</td>
</tr>
<tr >
<td>266</td>
<td> <a href="/s/smith01.shtml">Smith</a></td>
<td>34</td>
</tr>
</tbody>
</table>'''

table = LH.fromstring(content)
for df in pd.read_html(content):
    df['refname'] = table.xpath('//tr/td/a/@href')
    df['refname'] = df['refname'].str.extract(r'([^./]+)[.]')
    print(df)

yields

Click to copy

     0          1   2  refname
0  265  JonesBlue  29  jones03
1  266      Smith  34  smith01

The above may be useful since it requires only a few extra lines of code to add the refname column.

But both LH.fromstring and pd.read_html parse the HTML. So it's efficiency could be improved by removing pd.read_html and parsing the table once with LH.fromstring:

Click to copy

table = LH.fromstring(content)
# extract the text from `<td>` tags
data = [[elt.text_content() for elt in tr.xpath('td')] 
        for tr in table.xpath('//tr')]
df = pd.DataFrame(data, columns=['id', 'name', 'val'])
for col in ('id', 'val'):
    df[col] = df[col].astype(int)
# extract the href attribute values
df['refname'] = table.xpath('//tr/td/a/@href')
df['refname'] = df['refname'].str.extract(r'([^./]+)[.]')
print(df)

yields

Click to copy

    id        name  val  refname
0  265   JonesBlue   29  jones03
1  266       Smith   34  smith01

172

answered Oct 05 '22 09:10

unutbu

You could simply parse the table manually like this:

Click to copy

import BeautifulSoup
import pandas as pd

TABLE = """<table>
<tbody>
<tr>
<td>265</td>
<td <a href="/j/jones03.shtml">Jones</a>Blue</td>
<td >29</td>
</tr>
<tr >
<td>266</td>
<td <a href="/s/smith01.shtml">Smith</a></td>
<td>34</td>
</tr>
</tbody>
</table>"""

table = BeautifulSoup.BeautifulSoup(TABLE)
records = []
for tr in table.findAll("tr"):
    trs = tr.findAll("td")
    record = []
    record.append(trs[0].text)
    record.append(trs[1].a["href"])
    record.append(trs[2].text)
    records.append(record)

df = pd.DataFrame(data=records)
df

which gives you

Click to copy

     0                 1   2
0  265  /j/jones03.shtml  29
1  266  /s/smith01.shtml  34

answered Oct 05 '22 08:10

k-nut

You could use regular expressions to modify the text first and remove the html tags:

Click to copy

import re, pandas as pd
tbl = """<table>
<tbody>
<tr>
<td>265</td>
<td> <a href="/j/jones03.shtml">Jones</a>Blue</td>
<td>29</td>
</tr>
<tr >
<td>266</td>
<td> <a href="/s/smith01.shtml">Smith</a></td>
<td>34</td>
</tr>
</tbody>
</table>"""
tbl = re.sub('<a.*?href="(.*?)">(.*?)</a>', '\\1 \\2', tbl)
pd.read_html(tbl)

which gives you

Click to copy

[     0                           1   2
 0  265  /j/jones03.shtml JonesBlue  29
 1  266      /s/smith01.shtml Smith  34]

answered Oct 05 '22 09:10

freeseek

This available now in Pandas 1.5.0+ using the extract_links parameter.

Click to copy

extract_links - possible options: {None, “all”, “header”, “body”, “footer”}

Table elements in the specified section(s) with tags will have their href extracted.

Documentation

Example

Click to copy

html_table = """
<table>
<tr>
  <th>GitHub</th>
</tr>
<tr>
  <td><a href="https://github.com/pandas-dev/pandas">pandas</a> 
</td>
</tr>
</table>
"""


df = pd.read_html(
  html_table,
  extract_links="all"
)[0]

answered Oct 05 '22 10:10

Gabe

Related questions
                            
                                Django change password issue, super(type, obj): obj must be an instance or subtype of type
                            
                                Python threading Lock not working in simple example
                            
                                Django 1.7 - Accidentally Dropped One Table. How To Recover It?
                            
                                Insert a list of dictionary using sqlalchemy efficiently
                            
                                How to avoid nested "with" statements when working with multiple files in Python
                            
                                Confused about try/except with custom Exception
                            
                                How to use least squares with weight matrix?
                            
                                Define manually routes using Flask
                            
                                Using Scrapy to crawl a public FTP server
                            
                                How can I escape any of the special shell characters in a Python string?
                            
                                How to generate 2d numpy array?
                            
                                Using python's mock to temporarily delete an object from a dict
                            
                                Using enumerate function in while loops
                            
                                Boto SES - send_raw_email() to multiple recipients
                            
                                How to use Matlab's imresize in python
                            
                                Creating a real-time chat with Python and websocket
                            
                                django error TemplateDoesNotExist
                            
                                Best Way to add group totals to a dataframe in Pandas
                            
                                Sum of two variables in RobotFramework
                            
                                How to write cucumber Step definitions in python

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

HTML table to pandas table: Info inside html tags

Tags:

python

pandas

beautifulsoup

iayork

People also ask

4 Answers

unutbu

k-nut

freeseek

Gabe

Recent Activity

Donate For Us