Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

API capture all paginated data? (python)

I'm using the requests package to hit an API (greenhouse.io). The API is paginated so I need to loop through the pages to get all the data I want. Using something like:

results = []
for i in range(1,326+1):
    response = requests.get(url, 
                            auth=(username, password), 
                            params={'page':i,'per_page':100})
    if response.status_code == 200:
        results += response.json()

I know there are 326 pages by hitting the headers attribute:

In [8]:
response.headers['link']
Out[8]:
'<https://harvest.greenhouse.io/v1/applications?page=3&per_page=100>; rel="next",<https://harvest.greenhouse.io/v1/applications?page=1&per_page=100>; rel="prev",<https://harvest.greenhouse.io/v1/applications?page=326&per_page=100>; rel="last"'

Is there any way to extract this number automatically? Using the requests package? Or do I need to use regex or something?

Alternatively, should I somehow use a while loop to get all this data? What is the best way? Any thoughts?

like image 517
user3439329 Avatar asked Nov 17 '14 19:11

user3439329


People also ask

How do I paginate JSON data in Python?

Paginated JSON will usually have an object with links to the previous and next JSON pages. To get the previous page, you must send a request to the "prev" URL. To get to the next page, you must send a request to the "next" URL. This will deliver a new JSON with new results and new links for the next and previous pages.


1 Answers

The python requests library (http://docs.python-requests.org/en/latest/) can help here. The basic steps will be (1) all the request and grab the links from the header (you'll use this to get that last page info), and then (2) loop through the results until you're at that last page.

import requests

results = []
    
response = requests.get('https://harvest.greenhouse.io/v1/applications', auth=('APIKEY',''))
raw = response.json()  

for i in raw:  
    results.append(i) 

while response.links['next'] != response.links['last']:  
    r = requests.get(response.links['next'], auth=('APIKEY', '')  
    raw = r.json()  
    for i in raw:  
        results.append(i)
like image 111
tim_schaaf Avatar answered Oct 15 '22 19:10

tim_schaaf