Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Using pandas to read downloaded html file

As title, I tried using read_html but give me the following error:

In [17]:temp = pd.read_html('C:/age0.html',flavor='lxml')
  File "<string>", line unknown
XMLSyntaxError: htmlParseStartTag: misplaced <html> tag, line 65, column 6

What have I done wrong?

update 01

The HTML contains some javascript on top and then a html table. I used R to process it by parsing the html by XML package to give me a dataframe. I want to do it in python, should I use something else like beautifulsoup before giving it to pandas?

like image 723
lokheart Avatar asked Jul 31 '14 10:07

lokheart


People also ask

Can pandas read HTML file?

We can read tables of an HTML file using the read_html() function. This function read tables of HTML files as Pandas DataFrames. It can read from a file or a URL.

What Python library has the read_html () method we can we use for parsing HTML documents and extracting tables?

The Python community has come up with some pretty powerful web scrapping tools. Among them, Pandas read_html() is a quick and convenient way for scraping data from HTML tables. In this article, you'll learn Pandas read_html() to deal with the following common problems and should help you get started with web scraping.


2 Answers

I think you are on to the right track by using an html parser like beautiful soup. pandas.read_html() reads an html table not an html page.

You would want to do something like this...

from bs4 import BeautifulSoup
import pandas as pd

table = BeautifulSoup(open('C:/age0.html','r').read()).find('table')
df = pd.read_html(table) #I think it accepts BeatifulSoup object
                         #otherwise try str(table) as input
like image 130
ZJS Avatar answered Sep 24 '22 21:09

ZJS


  1. first of all install below packages for parsing purpose

    • pip install BeautifulSoup4
    • pip install lxml
    • pip install html5lib
  2. then use 'read_html' to read html table on any html page.


    import pandas as pds
    pds_df = pds.read_html('C:/age0.html')
    pds_df[0]
    

I hope this will help.

Good Luck!!

like image 21
srana Avatar answered Sep 25 '22 21:09

srana