Hi I have about 10 tables which I have used lxml to classify.
>>>import pandas as pd
>>>import lxml
>>>root = lxml.etree.HTML(htmlcontent)
>>>tables = root.findall('.//*[@id="info-container"]/table')
>>>readabletables = tables[::2]
>>>len(readabletables) = 5
>>>readabletables[0]
<Element table at 0x105241e60>
I want these 5 tables to be read and interpreted by pandas just like pd.read_html.
How would I go about doing this?
I am able to now answer my own question and maybe this can be of assistance to others.
I tried modifying the read_html source code in pandas without much success because of some recognition issues. Nonetheless the answer is much simpler than you might think.
>>>import pandas as pd
>>>import lxml
>>>root = lxml.etree.HTML(htmlcontent)
>>>tables = root.findall('.//*[@id="info-container"]/table')
>>>readabletables = tables[::2]
>>>len(readabletables) = 5
^ This is what we have already established.
Now in order for pandas's read_html to recognise a lxml table, the table need to be converted in to html. To this we do the following:
>>>etree.tostring(readabletables[0],method='html')
'<table... table>'
To convert all the tables in to pandas df inside a list:
>>>pd_tables = [pd.read_html(lxml.etree.tostring(table,method='html')) for table in readabletables]
>>>len(pd_tables)
5
>>>type(pd_tables[0])
<class 'pandas.core.frame.DataFrame'>
Mission accomplished.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With