I'm using the requests package to hit an API (greenhouse.io). The API is paginated so I need to loop through the pages to get all the data I want. Using something like:
results = []
for i in range(1,326+1):
response = requests.get(url,
auth=(username, password),
params={'page':i,'per_page':100})
if response.status_code == 200:
results += response.json()
I know there are 326 pages by hitting the headers attribute:
In [8]:
response.headers['link']
Out[8]:
'<https://harvest.greenhouse.io/v1/applications?page=3&per_page=100>; rel="next",<https://harvest.greenhouse.io/v1/applications?page=1&per_page=100>; rel="prev",<https://harvest.greenhouse.io/v1/applications?page=326&per_page=100>; rel="last"'
Is there any way to extract this number automatically? Using the requests package? Or do I need to use regex or something?
Alternatively, should I somehow use a while loop to get all this data? What is the best way? Any thoughts?
Paginated JSON will usually have an object with links to the previous and next JSON pages. To get the previous page, you must send a request to the "prev" URL. To get to the next page, you must send a request to the "next" URL. This will deliver a new JSON with new results and new links for the next and previous pages.
The python requests library (http://docs.python-requests.org/en/latest/) can help here. The basic steps will be (1) all the request and grab the links from the header (you'll use this to get that last page info), and then (2) loop through the results until you're at that last page.
import requests
results = []
response = requests.get('https://harvest.greenhouse.io/v1/applications', auth=('APIKEY',''))
raw = response.json()
for i in raw:
results.append(i)
while response.links['next'] != response.links['last']:
r = requests.get(response.links['next'], auth=('APIKEY', '')
raw = r.json()
for i in raw:
results.append(i)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With