Can't scrape all the company names from a webpage

Tags:

I'm trying to parse all the company names from this webpage. There are around 2431 companies in there. However, the way I've tried below can fetches me 1000 results.

This is what I can see about the number of results in response while going through dev tools:

hitsPerPage: 1000
index: "YCCompany_production"
nbHits: 2431      <------------------------       
nbPages: 1
page: 0

How can I get the rest of the results using requests?

I've tried so far:

import requests

url = 'https://45bwzj1sgc-dsn.algolia.net/1/indexes/*/queries?'

params = {
    'x-algolia-agent': 'Algolia for JavaScript (3.35.1); Browser; JS Helper (3.1.0)',
    'x-algolia-application-id': '45BWZJ1SGC',
    'x-algolia-api-key': 'NDYzYmNmMTRjYzU4MDE0ZWY0MTVmMTNiYzcwYzMyODFlMjQxMWI5YmZkMjEwMDAxMzE0OTZhZGZkNDNkYWZjMHJlc3RyaWN0SW5kaWNlcz0lNUIlMjJZQ0NvbXBhbnlfcHJvZHVjdGlvbiUyMiU1RCZ0YWdGaWx0ZXJzPSU1QiUyMiUyMiU1RCZhbmFseXRpY3NUYWdzPSU1QiUyMnljZGMlMjIlNUQ='
}
payload = {"requests":[{"indexName":"YCCompany_production","params":"hitsPerPage=1000&query=&page=0&facets=%5B%22top100%22%2C%22isHiring%22%2C%22nonprofit%22%2C%22batch%22%2C%22industries%22%2C%22subindustry%22%2C%22status%22%2C%22regions%22%5D&tagFilters="}]}

with requests.Session() as s:
    s.headers['User-Agent'] = 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36'
    r = s.post(url,params=params,json=payload)
    print(len(r.json()['results'][0]['hits']))

992

asked Jan 01 '21 05:01

MITHU

1 Answers

As a workaround you can simulate search using alphabet as a search pattern. Using code below you will get all 2431 companies as dictionary with ID as a key and full company data dictionary as a value.

import requests
import string

params = {
    'x-algolia-agent': 'Algolia for JavaScript (3.35.1); Browser; JS Helper (3.1.0)',
    'x-algolia-application-id': '45BWZJ1SGC',
    'x-algolia-api-key': 'NDYzYmNmMTRjYzU4MDE0ZWY0MTVmMTNiYzcwYzMyODFlMjQxMWI5YmZkMjEwMDAxMzE0OTZhZGZkNDNkYWZjMHJl'
                         'c3RyaWN0SW5kaWNlcz0lNUIlMjJZQ0NvbXBhbnlfcHJvZHVjdGlvbiUyMiU1RCZ0YWdGaWx0ZXJzPSU1QiUyMiUy'
                         'MiU1RCZhbmFseXRpY3NUYWdzPSU1QiUyMnljZGMlMjIlNUQ='
}

url = 'https://45bwzj1sgc-dsn.algolia.net/1/indexes/*/queries'
result = dict()
for letter in string.ascii_lowercase:
    print(letter)

    payload = {
        "requests": [{
            "indexName": "YCCompany_production",
            "params": "hitsPerPage=1000&query=" + letter + "&page=0&facets=%5B%22top100%22%2C%22isHiring%22%2C%22nonprofit%22%2C%22batch%22%2C%22industries%22%2C%22subindustry%22%2C%22status%22%2C%22regions%22%5D&tagFilters="
        }]
    }

    r = requests.post(url, params=params, json=payload)
    result.update({h['id']: h for h in r.json()['results'][0]['hits']})

print(len(result))

answered Sep 18 '22 15:09

Sers

Related questions
                            
                                When I use HttpResponseRedirect I get TypeError: quote_from_bytes() expected bytes in Django
                            
                                Google Colab-ValueError: Mountpoint must be in a directory that exists
                            
                                Scikit-learn - Cannot load MNIST Original dataset using fetch_openml in Python
                            
                                understanding object_pairs_hook in json.loads()
                            
                                asyncio gather scheduling order guarantee
                            
                                How to fix ImportError: cannot import name 'Event' in Dash from plotly (python)?
                            
                                An error occurred (ThrottlingException) when calling the GetDeployment operation (reached max retries: 4): Rate exceeded
                            
                                How can I make a map using GeoJSON data in Altair?
                            
                                How to transform some columns only with SimpleImputer or equivalent
                            
                                How can I provide shared state to my Flask app with multiple workers without depending on additional software?
                            
                                Serverless: python3.7 not found! Try the pythonBin option
                            
                                Python OpenCV skew correction for OCR
                            
                                Sum numbers in a list but change their sign after zero is encountered
                            
                                error importing 'BlobServiceClient' from 'azure.storage.blob'
                            
                                Move every second row to row above in pandas dataframe
                            
                                Python remove elements that are greater than a threshold from a list
                            
                                AttributeError: module 'time' has no attribute 'clock' In SQLAlchemy python 3.8.2
                            
                                How to delete all instances of a repeated number in a list? [duplicate]
                            
                                seaborn FutureWarning: Pass the following variables as keyword args: x, y
                            
                                ERROR: Command errored out with exit status 1: python setup.py egg_info Check the logs for full command output. when trying to install dotenv

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Can't scrape all the company names from a webpage

Tags:

python

python-3.x

python-requests

web-scraping

MITHU

People also ask

1 Answers

Sers

Recent Activity

Donate For Us