As title, I tried using read_html
but give me the following error:
In [17]:temp = pd.read_html('C:/age0.html',flavor='lxml')
File "<string>", line unknown
XMLSyntaxError: htmlParseStartTag: misplaced <html> tag, line 65, column 6
What have I done wrong?
The HTML contains some javascript on top and then a html table. I used R to process it by parsing the html by XML package to give me a dataframe. I want to do it in python, should I use something else like beautifulsoup before giving it to pandas?
We can read tables of an HTML file using the read_html() function. This function read tables of HTML files as Pandas DataFrames. It can read from a file or a URL.
The Python community has come up with some pretty powerful web scrapping tools. Among them, Pandas read_html() is a quick and convenient way for scraping data from HTML tables. In this article, you'll learn Pandas read_html() to deal with the following common problems and should help you get started with web scraping.
I think you are on to the right track by using an html parser like beautiful soup. pandas.read_html() reads an html table not an html page.
You would want to do something like this...
from bs4 import BeautifulSoup
import pandas as pd
table = BeautifulSoup(open('C:/age0.html','r').read()).find('table')
df = pd.read_html(table) #I think it accepts BeatifulSoup object
#otherwise try str(table) as input
first of all install below packages for parsing purpose
then use 'read_html' to read html table on any html page.
import pandas as pds
pds_df = pds.read_html('C:/age0.html')
pds_df[0]
I hope this will help.
Good Luck!!
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With