Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Scraping content from infinite scroll website

I am trying to scrape the links in a webpage with infinite scrolling. I am able to fetch only the links on the first pane. How to proceed ahead so as to form a complete list of all the links. Here is what i have so far -


from bs4 import BeautifulSoup
import requests

html = "https://www.carwale.com/used/cars-for-sale/#sc=-1&so=-1&car=7&pn=8&lcr=168&ldr=0&lir=0"
html_content = requests.get(html).text
soup = BeautifulSoup(html_content, "lxml")
table = soup.find_all("div", {"class": "card-detail-block__data"})

y = []
for i in table:
    try:
        y.append(i.find("a", {"id":"linkToDetails"}).get('href'))
    except AttributeError:
        pass

z = [('carwale.com' + item) for item in y]
z
like image 342
Anant Gupta Avatar asked Mar 08 '26 16:03

Anant Gupta


1 Answers

You do not need BeautifulSoup to ninja HTML dom at all, as the website provides JSON responses that populated the HTML. Requests alone can do the work. If you monitor "Network" from Chrome or Firefox Development tool, you will see that for each load, the browser sends a get request to an API. Using that we can get clean json data out.

Disclaimer: I have not checked if this site allows web scraping. Do double check their terms of use. I am assuming that you did that.

I used Pandas, to help in dealing with tabular data and also exporting data to CSV or whatever format you prefer: pip install pandas

import pandas as pd
from requests import Session

# Using Session and a header
req = Session() 
headers = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) '\
                         'AppleWebKit/537.36 (KHTML, like Gecko) '\
                         'Chrome/75.0.3770.80 Safari/537.36',
          'Content-Type': 'application/json;charset=UTF-8'}
# Add headers
req.headers.update(headers)

BASE_URL = 'https://www.carwale.com/webapi/classified/stockfilters/'

# Monitoring the updates on Network, the params changes in each load
#sc=-1&so=-1&car=7&pn=1
#sc=-1&so=-1&car=7&pn=2&lcr=24&ldr=0&lir=0
#sc=-1&so=-1&car=7&pn=3&lcr=48&ldr=0&lir=0
#sc=-1&so=-1&car=7&pn=4&lcr=72&ldr=0&lir=0

params = dict(sc=-1, so=-1, car=7, pn=4, lcr=72, ldr=0, lir=0)

r = req.get(BASE_URL, params=params) #just like requests.get

# Check if everything is okay
assert r.ok, 'We did not get 200'

# get json data
data = r.json()

# Put it in DataFrame
df = pd.DataFrame(data['ResultData'])

print(df.head())

# to go to another page create a function:

def scrap_carwale(params):
    r = req.get(BASE_URL, params=params)
    if not r.ok:
        raise ConnectionError('We did not get 200')
    data = r.json()

    return  pd.DataFrame(data['ResultData'])


# Just first 5 pages :)    
for i in range(5):
    params['pn']+=1
    params['lcr']*=2

    dt = scrap_carwale(params)
    #append your data
    df = df.append(dt)

#print data sample
print(df.sample(10)

# Save data to csv or whatever format
df.to_csv('my_data.csv') #see df.to_?

This is the network enter image description here

Response: enter image description here

Sample of Results enter image description here

like image 129
Prayson W. Daniel Avatar answered Mar 12 '26 16:03

Prayson W. Daniel



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!