Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

HTML table to pandas table: Info inside html tags

I have a large table from the web, accessed via requests and parsed with BeautifulSoup. Part of it looks something like this:

<table>
<tbody>
<tr>
<td>265</td>
<td> <a href="/j/jones03.shtml">Jones</a>Blue</td>
<td>29</td>
</tr>
<tr >
<td>266</td>
<td> <a href="/s/smith01.shtml">Smith</a></td>
<td>34</td>
</tr>
</tbody>
</table>

When I convert this to pandas using pd.read_html(tbl) the output is like this:

    0    1          2
 0  265  JonesBlue  29
 1  266  Smith      34

I need to keep the information in the <A HREF ... > tag, since the unique identifier is stored in the link. That is, the table should look like this:

    0    1        2
 0  265  jones03  29
 1  266  smith01  34

I'm fine with various other outputs (for example, jones03 Jones would be even more helpful) but the unique ID is critical.

Other cells also have html tags in them, and in general I don't want those to be saved, but if that's the only way of getting the uid I'm OK with keeping those tags and cleaning them up later, if I have to.

Is there a simple way of accessing this information?

like image 734
iayork Avatar asked Aug 02 '15 11:08

iayork


People also ask

How read data from HTML file in pandas?

To read an HTML file, pandas dataframe looks for a tag . That tag is called a <td></td> tag. This tag is used for defining a table in HTML. pandas uses read_html() to read the HTML document.

Can pandas read an HTML table?

We can read tables of an HTML file using the read_html() function. This function read tables of HTML files as Pandas DataFrames. It can read from a file or a URL.

How do you scrape a table in HTML?

Steps to scrape HTML table using Scrapy: Go to the web page that you want to scrape the table data from using your web browser. Inspect the element of the table using your browser's built-in developer tools or by viewing the source code. Launch Scrapy shell at the terminal with the web page URL as an argument.


4 Answers

Since this parsing job requires the extraction of both text and attribute values, it can not be done entirely "out-of-the-box" by a function such as pd.read_html. Some of it has to be done by hand.

Using lxml, you could extract the attribute values with XPath:

import lxml.html as LH
import pandas as pd

content = '''
<table>
<tbody>
<tr>
<td>265</td>
<td> <a href="/j/jones03.shtml">Jones</a>Blue</td>
<td >29</td>
</tr>
<tr >
<td>266</td>
<td> <a href="/s/smith01.shtml">Smith</a></td>
<td>34</td>
</tr>
</tbody>
</table>'''

table = LH.fromstring(content)
for df in pd.read_html(content):
    df['refname'] = table.xpath('//tr/td/a/@href')
    df['refname'] = df['refname'].str.extract(r'([^./]+)[.]')
    print(df)

yields

     0          1   2  refname
0  265  JonesBlue  29  jones03
1  266      Smith  34  smith01

The above may be useful since it requires only a few extra lines of code to add the refname column.

But both LH.fromstring and pd.read_html parse the HTML. So it's efficiency could be improved by removing pd.read_html and parsing the table once with LH.fromstring:

table = LH.fromstring(content)
# extract the text from `<td>` tags
data = [[elt.text_content() for elt in tr.xpath('td')] 
        for tr in table.xpath('//tr')]
df = pd.DataFrame(data, columns=['id', 'name', 'val'])
for col in ('id', 'val'):
    df[col] = df[col].astype(int)
# extract the href attribute values
df['refname'] = table.xpath('//tr/td/a/@href')
df['refname'] = df['refname'].str.extract(r'([^./]+)[.]')
print(df)

yields

    id        name  val  refname
0  265   JonesBlue   29  jones03
1  266       Smith   34  smith01
like image 172
unutbu Avatar answered Oct 05 '22 09:10

unutbu


You could simply parse the table manually like this:

import BeautifulSoup
import pandas as pd

TABLE = """<table>
<tbody>
<tr>
<td>265</td>
<td <a href="/j/jones03.shtml">Jones</a>Blue</td>
<td >29</td>
</tr>
<tr >
<td>266</td>
<td <a href="/s/smith01.shtml">Smith</a></td>
<td>34</td>
</tr>
</tbody>
</table>"""

table = BeautifulSoup.BeautifulSoup(TABLE)
records = []
for tr in table.findAll("tr"):
    trs = tr.findAll("td")
    record = []
    record.append(trs[0].text)
    record.append(trs[1].a["href"])
    record.append(trs[2].text)
    records.append(record)

df = pd.DataFrame(data=records)
df

which gives you

     0                 1   2
0  265  /j/jones03.shtml  29
1  266  /s/smith01.shtml  34
like image 31
k-nut Avatar answered Oct 05 '22 08:10

k-nut


You could use regular expressions to modify the text first and remove the html tags:

import re, pandas as pd
tbl = """<table>
<tbody>
<tr>
<td>265</td>
<td> <a href="/j/jones03.shtml">Jones</a>Blue</td>
<td>29</td>
</tr>
<tr >
<td>266</td>
<td> <a href="/s/smith01.shtml">Smith</a></td>
<td>34</td>
</tr>
</tbody>
</table>"""
tbl = re.sub('<a.*?href="(.*?)">(.*?)</a>', '\\1 \\2', tbl)
pd.read_html(tbl)

which gives you

[     0                           1   2
 0  265  /j/jones03.shtml JonesBlue  29
 1  266      /s/smith01.shtml Smith  34]
like image 34
freeseek Avatar answered Oct 05 '22 09:10

freeseek


This available now in Pandas 1.5.0+ using the extract_links parameter.

extract_links - possible options: {None, “all”, “header”, “body”, “footer”}

Table elements in the specified section(s) with tags will have their href extracted.

  • Documentation

  • Example

    html_table = """
    <table>
    <tr>
      <th>GitHub</th>
    </tr>
    <tr>
      <td><a href="https://github.com/pandas-dev/pandas">pandas</a> 
    </td>
    </tr>
    </table>
    """
    
    
    df = pd.read_html(
      html_table,
      extract_links="all"
    )[0]
    
like image 39
Gabe Avatar answered Oct 05 '22 10:10

Gabe