As title, I tried using <code>read_html</code> but give me the following error: <pre class="prettyprint"><code>In [17]:temp = pd.read_html('C:/age0.html',flavor='lxml') File "<string>", line unknown XMLSyntaxError: htmlParseStartTag: misplaced <html> tag, line 65, column 6 </code></pre> What have I done wrong? <h3>update 01</h3> The HTML contains some javascript on top and then a html table. I used R to process it by parsing the html by XML package to give me a dataframe. I want to do it in python, should I use something else like beautifulsoup before giving it to pandas?

I think you are on to the right track by using an html parser like beautiful soup. pandas.read_html() reads an html table not an html page. You would want to do something like this... <pre class="prettyprint"><code>from bs4 import BeautifulSoup import pandas as pd table = BeautifulSoup(open('C:/age0.html','r').read()).find('table') df = pd.read_html(table) #I think it accepts BeatifulSoup object #otherwise try str(table) as input </code></pre>

<ol> <li> first of all install below packages for parsing purpose <ul> <li>pip install BeautifulSoup4</li> <li>pip install lxml</li> <li>pip install html5lib</li> </ul> </li> <li> then use 'read_html' to read html table on any html page. <hr> <pre class="prettyprint"><code>import pandas as pds pds_df = pds.read_html('C:/age0.html') pds_df[0] </code></pre> <hr> </li> </ol> I hope this will help. Good Luck!!

Using pandas to read downloaded html file

Tags:

python

html

import

pandas

As title, I tried using read_html but give me the following error:

In [17]:temp = pd.read_html('C:/age0.html',flavor='lxml')
  File "<string>", line unknown
XMLSyntaxError: htmlParseStartTag: misplaced <html> tag, line 65, column 6

What have I done wrong?

update 01

The HTML contains some javascript on top and then a html table. I used R to process it by parsing the html by XML package to give me a dataframe. I want to do it in python, should I use something else like beautifulsoup before giving it to pandas?

723

asked Jul 31 '14 10:07

lokheart

2 Answers

I think you are on to the right track by using an html parser like beautiful soup. pandas.read_html() reads an html table not an html page.

You would want to do something like this...

from bs4 import BeautifulSoup
import pandas as pd

table = BeautifulSoup(open('C:/age0.html','r').read()).find('table')
df = pd.read_html(table) #I think it accepts BeatifulSoup object
                         #otherwise try str(table) as input

130

answered Sep 24 '22 21:09

ZJS

first of all install below packages for parsing purpose
- pip install BeautifulSoup4
- pip install lxml
- pip install html5lib

then use 'read_html' to read html table on any html page.

import pandas as pds
pds_df = pds.read_html('C:/age0.html')
pds_df[0]

I hope this will help.

Good Luck!!

answered Sep 25 '22 21:09

srana

Related questions
                            
                                Python negative subscripting
                            
                                Solving system of nonlinear equations with python
                            
                                How can check the distribution of a variable in python? [closed]
                            
                                Is there a limit on TextBlob translation?
                            
                                A Complete Many-to-One Example Using Flask, WTForm, SQLAlchemy, and Jinja2
                            
                                Python recursion RuntimeError
                            
                                How to ignore NaN in colorbar?
                            
                                spyder matplotlib UserWarning: This call to matplotlib.use() has no effect because the backend has already been chosen
                            
                                Error when running Python parameterized test method
                            
                                How can I remove specific instructions from kivy widget canvas?
                            
                                Python Command Line Checkboxes
                            
                                Running multiple uwsgi python versions
                            
                                How do I list files in Asyncio? [closed]
                            
                                How to escape dot in python str.format
                            
                                Can a Java GUI control a Python backend?
                            
                                OpenCV Assertion Failed error: (-215) scn == 3 || scn == 4 in function cv::cvtColor works ALTERNATE times
                            
                                How to reflect database objects using Pony ORM?
                            
                                Celery First Steps - timeout error on result.get()
                            
                                How to share sessions between modules on a Google App Engine Python application?
                            
                                Python iterator is empty after performing some action on it

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With