Here, <pre class="prettyprint"><code>http://www.ffiec.gov/census/report.aspx?year=2011&state=01&report=demographic&msa=11500 </code></pre> There is a table. My goal is to extract the table and save it to a csv file. I wrote a code: <pre class="prettyprint"><code>import urllib import os web = urllib.urlopen("http://www.ffiec.gov/census/report.aspx?year=2011&state=01&report=demographic&msa=11500") s = web.read() web.close() ff = open(r"D:\ex\python_ex\urllib\output.txt", "w") ff.write(s) ff.close() </code></pre> I lost from here. Anyone who can help on this? Thanks!

Pandas can do this right out of the box, saving you from having to parse the html yourself. <code>to_html()</code> extracts all tables from your html and puts them in a list of dataframes. <code>to_csv()</code> can be used to convert each dataframe to a csv file. For the web page in your example, the relevant table is the last one, which is why I used <code>df_list[-1]</code> in the code below. <pre class="prettyprint"><code>import requests import pandas as pd url = 'http://www.ffiec.gov/census/report.aspx?year=2011&state=01&report=demographic&msa=11500' html = requests.get(url).content df_list = pd.read_html(html) df = df_list[-1] print(df) df.to_csv('my data.csv') </code></pre> It's simple enough to do in one line, if you prefer: <pre class="prettyprint"><code>pd.read_html(requests.get(<url>).content)[-1].to_csv(<csv file>) </code></pre> P.S. Just make sure you have <code>lxml</code>, <code>html5lib</code>, and <code>BeautifulSoup4</code> packages installed in advance.

How to extract tables from websites in Python

Tags:

python

urllib

Here,

http://www.ffiec.gov/census/report.aspx?year=2011&state=01&report=demographic&msa=11500

There is a table. My goal is to extract the table and save it to a csv file. I wrote a code:

import urllib
import os

web = urllib.urlopen("http://www.ffiec.gov/census/report.aspx?year=2011&state=01&report=demographic&msa=11500")

s = web.read()
web.close()

ff = open(r"D:\ex\python_ex\urllib\output.txt", "w")
ff.write(s)
ff.close()

I lost from here. Anyone who can help on this? Thanks!

816

asked May 11 '12 17:05

Bill TP

4 Answers

Pandas can do this right out of the box, saving you from having to parse the html yourself. to_html() extracts all tables from your html and puts them in a list of dataframes. to_csv() can be used to convert each dataframe to a csv file. For the web page in your example, the relevant table is the last one, which is why I used df_list[-1] in the code below.

import requests
import pandas as pd

url = 'http://www.ffiec.gov/census/report.aspx?year=2011&state=01&report=demographic&msa=11500'
html = requests.get(url).content
df_list = pd.read_html(html)
df = df_list[-1]
print(df)
df.to_csv('my data.csv')

It's simple enough to do in one line, if you prefer:

pd.read_html(requests.get(<url>).content)[-1].to_csv(<csv file>)

P.S. Just make sure you have lxml, html5lib, and BeautifulSoup4 packages installed in advance.

182

answered Oct 15 '22 12:10

MarredCheese

So essentially you want to parse out html file to get elements out of it. You can use BeautifulSoup or lxml for this task.

You already have solutions using BeautifulSoup. I'll post a solution using lxml:

from lxml import etree
import urllib.request

web = urllib.request.urlopen("http://www.ffiec.gov/census/report.aspx?year=2011&state=01&report=demographic&msa=11500")
s = web.read()

html = etree.HTML(s)

## Get all 'tr'
tr_nodes = html.xpath('//table[@id="Report1_dgReportDemographic"]/tr')

## 'th' is inside first 'tr'
header = [i[0].text for i in tr_nodes[0].xpath("th")]

## Get text from rest all 'tr'
td_content = [[td.text for td in tr.xpath('td')] for tr in tr_nodes[1:]]

answered Oct 15 '22 11:10

Vikas

I would recommend BeautifulSoup as it has the most functionality. I modified a table parser that I found online that can extract all tables from a webpage, as long as there are no nested tables. Some of the code is specific to the problem I was trying to solve, but it should be pretty easy to modify for your usage. Here is the pastbin link.

http://pastebin.com/RPNbtX8Q

You could use it as follows:

from urllib2 import Request, urlopen, URLError
from TableParser import TableParser
url_addr ='http://foo/bar'
req = Request(url_addr)
url = urlopen(req)
tp = TableParser()
tp.feed(url.read())

# NOTE: Here you need to know exactly how many tables are on the page and which one
# you want. Let's say it's the first table
my_table = tp.get_tables()[0]
filename = 'table_as_csv.csv'
f = open(filename, 'wb')
with f:
    writer = csv.writer(f)
    for row in table:
        writer.writerow(row)

The code above is an outline, but if you use the table parser from the pastbin link you should be able to get to where you want to go.

answered Oct 15 '22 13:10

aquil.abdullah

You need to parse the table into an internal data structure and then output it in CSV form.

Use BeautifulSoup to parse the table. This question is about how to do that (the accepted answer uses version 3.0.8 which is out of date by now, but you can still use it, or convert the instructions to work with BeautifulSoup version 4).

Once you have the table in a data structure (probably a list of lists in this case) you can write it out with csv.write.

answered Oct 15 '22 12:10

Andrew Gorcester

Related questions
                            
                                How to map one list to another in python? [duplicate]
                            
                                Traceback: AttributeError:addinfourl instance has no attribute '__exit__'
                            
                                geckodriver executable needs to be in path
                            
                                OpenCV Python : rotate image without cropping sides
                            
                                plot pandas dataframe two columns
                            
                                Kill a python process
                            
                                If statement to check whether a string has a capital letter, a lower case letter and a number [closed]
                            
                                Flask middleware for specific route
                            
                                Python error: TypeError: 'module' object is not callable for HeadFirst Python code
                            
                                python eval vs ast.literal_eval vs JSON decode
                            
                                Write Pandas DataFrame to newline-delimited JSON
                            
                                Parsing a pipe delimited file in python
                            
                                Pythonic way to find maximum absolute value of list
                            
                                Python, Ruby, Haskell - Do they provide true multithreading?
                            
                                Python Encrypting with PyCrypto AES
                            
                                pandas to_csv: ascii can't encode character
                            
                                What could cause a Django error when debug=False that isn't there when debug=True
                            
                                How to choose the value and label from Django ModelChoiceField queryset
                            
                                Changing password in Django Admin
                            
                                Numpy Broadcast to perform euclidean distance vectorized

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With