Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Scraping second page in Python Yields Different Data Than Browsing to Second Page

I'm attempting to scrape some data from www.ksl.com/auto/ using Python Requests and Beautiful Soup. I'm able to get the results from the first search page but not subsequent pages. When I request the second page using the same URL Chrome constructs when I click the "Next" button on the page, I get a set of results that no longer matches my search query. I've found other questions on Stack Overflow that discuss Ajax calls that load subsequent pages, and using Chrome's Developer tools to examine the request. But, none of that has helped me with this problem -- which I've had on other sites as well.

Here is an example query that returns only Acuras on the site. When you advance in the browser to the second page, the URL is simply this: https://www.ksl.com/auto/search/index?page=1. When I use Requests to hit those two URLs, the second search results are not Acuras. Is there, perhaps a cookie that my browser is passing back to the server to preserve my filters?

I would appreciate any advice someone can give about how to get subsequent pages of the results I searched for.

Here is the simple code I'm using:

from requests import get
from bs4 import BeautifulSoup

page1 = get('https://www.ksl.com/auto/search/index?keyword=&make%5B%5D=Acura&yearFrom=&yearTo=&mileageFrom=&mileageTo=&priceFrom=&priceTo=&zip=&miles=25&newUsed%5B%5D=All&page=0&sellerType=&postedTime=&titleType=&body=&transmission=&cylinders=&liters=&fuel=&drive=&numberDoors=&exteriorCondition=&interiorCondition=&cx_navSource=hp_search&search.x=63&search.y=8&search=Search+raquo%3B').text
page2 = get('https://www.ksl.com/auto/search/index?page=2').text

soup = BeautifulSoup(page1, 'html.parser')
listings = soup.findAll("div", { "class" : "srp-listing-body-right" })
listings[0] # An Acura - success!

soup2 = BeautifulSoup(page2, 'html.parser')
listings2 = soup2.findAll("div", { "class" : "srp-listing-body-right" })
listings2[0] # Not an Acura. :(
like image 711
Matt Frei Avatar asked Mar 07 '26 21:03

Matt Frei


1 Answers

Try this. Create a Session object and then call the links. This will maintain your session with the server when you send a call to the next link.

import requests
from bs4 import BeautifulSoup

s = requests.Session() # Add this line

page1 = s.get('https://www.ksl.com/auto/search/index?keyword=&make%5B%5D=Acura&yearFrom=&yearTo=&mileageFrom=&mileageTo=&priceFrom=&priceTo=&zip=&miles=25&newUsed%5B%5D=All&page=0&sellerType=&postedTime=&titleType=&body=&transmission=&cylinders=&liters=&fuel=&drive=&numberDoors=&exteriorCondition=&interiorCondition=&cx_navSource=hp_search&search.x=63&search.y=8&search=Search+raquo%3B').text
page2 = s.get('https://www.ksl.com/auto/search/index?page=1').text
like image 143
JRodDynamite Avatar answered Mar 09 '26 09:03

JRodDynamite