Here is the Website I am trying to scrape http://livingwage.mit.edu/ The specific URLs are from <pre class="prettyprint"><code>http://livingwage.mit.edu/states/01 http://livingwage.mit.edu/states/02 http://livingwage.mit.edu/states/04 (For some reason they skipped 03) ...all the way to... http://livingwage.mit.edu/states/56 </code></pre> And on each one of these URLs, I need the last row of the second table: <blockquote> Example for http://livingwage.mit.edu/states/01 Required annual income before taxes $20,260 $42,786 $51,642 $64,767 $34,325 $42,305 $47,345 $53,206 $34,325 $47,691 $56,934 $66,997 </blockquote> Desire output: Alabama $20,260 $42,786 $51,642 $64,767 $34,325 $42,305 $47,345 $53,206 $34,325 $47,691 $56,934 $66,997 Alaska $24,070 $49,295 $60,933 $79,871 $38,561 $47,136 $52,233 $61,531 $38,561 $54,433 $66,316 $82,403 ... ... Wyoming $20,867 $42,689 $52,007 $65,892 $34,988 $41,887 $46,983 $53,549 $34,988 $47,826 $57,391 $68,424 After 2 hours of messing around, this is what I have so far (I am a beginner): <pre class="prettyprint"><code>import requests, bs4 res = requests.get('http://livingwage.mit.edu/states/01') res.raise_for_status() states = bs4.BeautifulSoup(res.text) state_name=states.select('h1') table = states.find_all('table')[1] rows = table.find_all('tr', 'odd')[4:] result=[] result.append(state_name) result.append(rows) </code></pre> When I viewed the state_name and rows in Python Console it give me the html elements <pre class="prettyprint"><code>[<h1>Living Wag...Alabama</h1>] </code></pre> and <pre class="prettyprint"><code>[<tr class = "odd... </td> </tr>] </code></pre> Problem 1: These are the things that I want in the desired output, but how can I get python to give it to me in a string format rather than HTML like above? Problem 2: How do I loop through the request.get(url01 to url56)? Thank you for your help. And if you can offer a more efficient way of getting to the rows variable in my code, I would greatly appreciate it, because the way I get there is not very Pythonic.

<blockquote> Problem 1: These are the things that I want in the desired output, but how can I get python to give it to me in a string format rather than HTML like above? </blockquote> You can get the text by simply by doing something on the lines of: <pre class="prettyprint"><code>state_name=states.find('h1').text </code></pre> The same can be applied for each of the rows too. <blockquote> Problem 2: How do I loop through the request.get(url01 to url56)? </blockquote> The same code block can be put inside a loop from 1 to 56 like so: <pre class="prettyprint"><code>for i in range(1,57): res = requests.get('http://livingwage.mit.edu/states/'+str(i).zfill(2)) ...rest of the code... </code></pre> <code>zfill</code> will add those leading zeroes. Also, it would be better if <code>requests.get</code> is enclosed in a <code>try-except</code> block so that the loop continues gracefully even when the url is wrong.

Python->Beautifulsoup->Webscraping->Looping over URL (1 to 53) and saving Results

Tags:

python

beautifulsoup

web-scraping

Here is the Website I am trying to scrape http://livingwage.mit.edu/

The specific URLs are from

http://livingwage.mit.edu/states/01

http://livingwage.mit.edu/states/02

http://livingwage.mit.edu/states/04 (For some reason they skipped 03)

...all the way to...

http://livingwage.mit.edu/states/56

And on each one of these URLs, I need the last row of the second table:

Example for http://livingwage.mit.edu/states/01

Required annual income before taxes $20,260 $42,786 $51,642 $64,767 $34,325 $42,305 $47,345 $53,206 $34,325 $47,691 $56,934 $66,997

Desire output:

Alabama $20,260 $42,786 $51,642 $64,767 $34,325 $42,305 $47,345 $53,206 $34,325 $47,691 $56,934 $66,997

Alaska $24,070 $49,295 $60,933 $79,871 $38,561 $47,136 $52,233 $61,531 $38,561 $54,433 $66,316 $82,403

...

Wyoming $20,867 $42,689 $52,007 $65,892 $34,988 $41,887 $46,983 $53,549 $34,988 $47,826 $57,391 $68,424

After 2 hours of messing around, this is what I have so far (I am a beginner):

import requests, bs4

res = requests.get('http://livingwage.mit.edu/states/01')

res.raise_for_status()
states = bs4.BeautifulSoup(res.text)


state_name=states.select('h1')

table = states.find_all('table')[1]
rows = table.find_all('tr', 'odd')[4:]


result=[]

result.append(state_name)
result.append(rows)

When I viewed the state_name and rows in Python Console it give me the html elements

[<h1>Living Wag...Alabama</h1>]

and

[<tr class = "odd...   </td> </tr>]

Problem 1: These are the things that I want in the desired output, but how can I get python to give it to me in a string format rather than HTML like above?

Problem 2: How do I loop through the request.get(url01 to url56)?

Thank you for your help.

And if you can offer a more efficient way of getting to the rows variable in my code, I would greatly appreciate it, because the way I get there is not very Pythonic.

616

asked Aug 11 '16 12:08

Omi Slash

2 Answers

Just get all the states from the initial page, then you can select the second table and use the css classes odd results to get the tr you need, there is no need to slice as the class names are unique:

import requests
from bs4 import BeautifulSoup
from urllib.parse import  urljoin # python2 -> from urlparse import urljoin 


base = "http://livingwage.mit.edu"
res = requests.get(base)

res.raise_for_status()
states = []
# Get all state urls and state name from the anchor tags on the base page.
# td + td skips the first td which is *Required annual income before taxes*
# get all the anchors inside each li that are children of the
# ul with the css class  "states list".
for a in BeautifulSoup(res.text, "html.parser").select("ul.states.list-unstyled li a"):
    # The hrefs look like "/states/51/locations".
    #  We want everything before /locations so we split on / from the right -> /states/51/
    # and join to the base url. The anchor text also holds the state name,
    # so we return the full url and the state, i.e "http://livingwage.mit.edu/states/01 "Alabama".
    states.append((urljoin(base, a["href"].rsplit("/", 1)[0]), a.text))


def parse(soup):
    # Get the second table, indexing in css starts at 1, so table:nth-of-type(2)" gets the second table.
    table = soup.select_one("table:nth-of-type(2)")
    # To get the text, we just need find all the tds and call .text on each.
    #  Each td we want has the css class "odd results", td + td starts from the second as we don't want the first.
    return [td.text.strip() for td in table.select_one("tr.odd.results").select("td + td")]


# Unpack the url and state from each tuple in our states list. 
for url, state in states:
    soup = BeautifulSoup(requests.get(url).content, "html.parser")
    print(state, parse(soup))

If you run the code you will see output like:

Alabama ['$21,144', '$43,213', '$53,468', '$67,788', '$34,783', '$41,847', '$46,876', '$52,531', '$34,783', '$48,108', '$58,748', '$70,014']
Alaska ['$24,070', '$49,295', '$60,933', '$79,871', '$38,561', '$47,136', '$52,233', '$61,531', '$38,561', '$54,433', '$66,316', '$82,403']
Arizona ['$21,587', '$47,153', '$59,462', '$78,112', '$36,332', '$44,913', '$50,200', '$58,615', '$36,332', '$52,483', '$65,047', '$80,739']
Arkansas ['$19,765', '$41,000', '$50,887', '$65,091', '$33,351', '$40,337', '$45,445', '$51,377', '$33,351', '$45,976', '$56,257', '$67,354']
California ['$26,249', '$55,810', '$64,262', '$81,451', '$42,433', '$52,529', '$57,986', '$68,826', '$42,433', '$61,328', '$70,088', '$84,192']
Colorado ['$23,573', '$51,936', '$61,989', '$79,343', '$38,805', '$47,627', '$52,932', '$62,313', '$38,805', '$57,283', '$67,593', '$81,978']
Connecticut ['$25,215', '$54,932', '$64,882', '$80,020', '$39,636', '$48,787', '$53,857', '$61,074', '$39,636', '$60,074', '$70,267', '$82,606']

You could loop in a range from 1-53 but extracting the anchor from the base page also gives us the state name in a single step, using the h1 from that page would also give you output Living Wage Calculation for Alabama which you would have to then try to parse to just get the name which would not be trivial considering some states have more the one word names.

answered Nov 03 '22 07:11

Padraic Cunningham

Problem 1: These are the things that I want in the desired output, but how can I get python to give it to me in a string format rather than HTML like above?

You can get the text by simply by doing something on the lines of:

state_name=states.find('h1').text

The same can be applied for each of the rows too.

Problem 2: How do I loop through the request.get(url01 to url56)?

The same code block can be put inside a loop from 1 to 56 like so:

for i in range(1,57):
    res = requests.get('http://livingwage.mit.edu/states/'+str(i).zfill(2))
    ...rest of the code...

zfill will add those leading zeroes. Also, it would be better if requests.get is enclosed in a try-except block so that the loop continues gracefully even when the url is wrong.

answered Nov 03 '22 06:11

ham

Related questions
                            
                                Random Sampling of Pandas data frame (both rows and columns)
                            
                                How to implement left outer join in python pandas? [duplicate]
                            
                                Pandas: increment datetime
                            
                                Django include template tag in for loop only catches first iteration
                            
                                Can't seem to retrieve stripe charge using python
                            
                                Potential Exceptions using builtin str() type in Python
                            
                                DoesNotExist at /accounts/register/ Site matching query does not exist. (django, python)
                            
                                add labels to sklearn k-means
                            
                                Select rows from a pandas dataframe where two columns match list of pairs
                            
                                Aligning a text box edge with an image corner
                            
                                Highlight specific points in matplotlib scatterplot
                            
                                Can I add permissions to media django media files?
                            
                                Volume Yahoo Finance
                            
                                Limit Google OAuth access to one domain using 'hd' param (Django / python-social-auth)
                            
                                Flatten nested pandas dataframe
                            
                                Pandas Flatten a dataframe to a single column
                            
                                Synchronizing code between jupyter/iPython notebook script and class methods
                            
                                python get data from div blocks
                            
                                mypy "invalid type" error
                            
                                Create dictionary from another dictionary with the fastest and scalable way

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With