Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to convert a html table into pandas dataframe

pandas provides an useful to_html() to convert the DataFrame into the html table. Is there any useful function to read it back to the DataFrame?

like image 624
waitingkuo Avatar asked Apr 15 '13 07:04

waitingkuo


People also ask

Can pandas read an HTML table?

We can read tables of an HTML file using the read_html() function. This function read tables of HTML files as Pandas DataFrames. It can read from a file or a URL.

How extract HTML table data from Python?

To extract a table from HTML, you first need to open your developer tools to see how the HTML looks and verify if it really is a table and not some other element. You open developer tools with the F12 key, see the “Elements” tab, and highlight the element you're interested in.


2 Answers

The read_html utility released in pandas 0.12

like image 142
waitingkuo Avatar answered Sep 20 '22 20:09

waitingkuo


In the general case it is not possible but if you approximately know the structure of your table you could something like this:

# Create a test df:
>>> df = DataFrame(np.random.rand(4,5), columns = list('abcde'))
>>> df
     a           b           c           d           e
0    0.675006    0.230464    0.386991    0.422778    0.657711
1    0.250519    0.184570    0.470301    0.811388    0.762004
2    0.363777    0.715686    0.272506    0.124069    0.045023
3    0.657702    0.783069    0.473232    0.592722    0.855030

Now parse the html and reconstruct:

from pyquery import PyQuery as pq

d = pq(df.to_html())
columns = d('thead tr').eq(0).text().split()
n_rows = len(d('tbody tr'))
values = np.array(d('tbody tr td').text().split(), dtype=float).reshape(n_rows, len(columns))
>>> DataFrame(values, columns=columns)

     a           b           c           d           e
0    0.675006    0.230464    0.386991    0.422778    0.657711
1    0.250519    0.184570    0.470301    0.811388    0.762004
2    0.363777    0.715686    0.272506    0.124069    0.045023
3    0.657702    0.783069    0.473232    0.592722    0.855030

You could extend it for Multiindex dfs or automatic type detection using eval() if needed.

like image 23
elyase Avatar answered Sep 18 '22 20:09

elyase