I'm trying to scrape a table from a dynamic page. After the following code (requires selenium), I manage to get the contents of the <table>
elements.
I'd like to convert this table into a csv and I have tried 2 things, but both fail:
pandas.read_html
returns an error saying I don't have html5lib installed, but I do and in fact I can import it without problems.soup.find_all('tr')
returns an error 'NoneType' object is not callable
after I run soup = BeautifulSoup(tablehtml)
Here is my code:
import time
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.keys import Keys
import pandas as pd
main_url = "http://data.stats.gov.cn/english/easyquery.htm?cn=E0101"
driver = webdriver.Firefox()
driver.get(main_url)
time.sleep(7)
driver.find_element_by_partial_link_text("Industry").click()
time.sleep(7)
driver.find_element_by_partial_link_text("Main Economic Indicat").click()
time.sleep(6)
driver.find_element_by_id("mySelect_sj").click()
time.sleep(2)
driver.find_element_by_class_name("dtText").send_keys("last72")
time.sleep(3)
driver.find_element_by_class_name("dtTextBtn").click()
time.sleep(2)
table=driver.find_element_by_id("table_main")
tablehtml= table.get_attribute('innerHTML')
Using the csv
module and selenium
selectors would probably be more convenient here:
import csv
from selenium import webdriver
driver = webdriver.Firefox()
driver.get("http://example.com/")
table = driver.find_element_by_css_selector("#tableid")
with open('eggs.csv', 'w', newline='') as csvfile:
wr = csv.writer(csvfile)
for row in table.find_elements_by_css_selector('tr'):
wr.writerow([d.text for d in row.find_elements_by_css_selector('td')])
Without access to the table you're actually trying to scrape, I used this example:
<table>
<thead>
<tr>
<td>Header1</td>
<td>Header2</td>
<td>Header3</td>
</tr>
</thead>
<tr>
<td>Row 11</td>
<td>Row 12</td>
<td>Row 13</td>
</tr>
<tr>
<td>Row 21</td>
<td>Row 22</td>
<td>Row 23</td>
</tr>
<tr>
<td>Row 31</td>
<td>Row 32</td>
<td>Row 33</td>
</tr>
</table>
and scraped it using:
from bs4 import BEautifulSoup as BS
content = #contents of that table
soup = BS(content, 'html5lib')
rows = [tr.findAll('td') for tr in soup.findAll('tr')]
This rows object is a list of lists:
[
[<td>Header1</td>, <td>Header2</td>, <td>Header3</td>],
[<td>Row 11</td>, <td>Row 12</td>, <td>Row 13</td>],
[<td>Row 21</td>, <td>Row 22</td>, <td>Row 23</td>],
[<td>Row 31</td>, <td>Row 32</td>, <td>Row 33</td>]
]
...and you can write it to a file:
for it in rows:
with open('result.csv', 'a') as f:
f.write(", ".join(str(e).replace('<td>','').replace('</td>','') for e in it) + '\n')
which looks like this:
Header1, Header2, Header3
Row 11, Row 12, Row 13
Row 21, Row 22, Row 23
Row 31, Row 32, Row 33
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With