I'm using the requests package to hit an API (greenhouse.io). The API is paginated so I need to loop through the pages to get all the data I want. Using something like: <pre class="prettyprint"><code>results = [] for i in range(1,326+1): response = requests.get(url, auth=(username, password), params={'page':i,'per_page':100}) if response.status_code == 200: results += response.json() </code></pre> I know there are 326 pages by hitting the headers attribute: <pre class="prettyprint"><code>In [8]: response.headers['link'] Out[8]: '<https://harvest.greenhouse.io/v1/applications?page=3&per_page=100>; rel="next",<https://harvest.greenhouse.io/v1/applications?page=1&per_page=100>; rel="prev",<https://harvest.greenhouse.io/v1/applications?page=326&per_page=100>; rel="last"' </code></pre> Is there any way to extract this number automatically? Using the requests package? Or do I need to use regex or something? Alternatively, should I somehow use a while loop to get all this data? What is the best way? Any thoughts?

The python requests library (http://docs.python-requests.org/en/latest/) can help here. The basic steps will be (1) all the request and grab the links from the header (you'll use this to get that last page info), and then (2) loop through the results until you're at that last page. <pre class="prettyprint"><code>import requests results = [] response = requests.get('https://harvest.greenhouse.io/v1/applications', auth=('APIKEY','')) raw = response.json() for i in raw: results.append(i) while response.links['next'] != response.links['last']: r = requests.get(response.links['next'], auth=('APIKEY', '') raw = r.json() for i in raw: results.append(i) </code></pre>

API capture all paginated data? (python)

Tags:

python

regex

pagination

api

I'm using the requests package to hit an API (greenhouse.io). The API is paginated so I need to loop through the pages to get all the data I want. Using something like:

results = []
for i in range(1,326+1):
    response = requests.get(url, 
                            auth=(username, password), 
                            params={'page':i,'per_page':100})
    if response.status_code == 200:
        results += response.json()

I know there are 326 pages by hitting the headers attribute:

In [8]:
response.headers['link']
Out[8]:
'<https://harvest.greenhouse.io/v1/applications?page=3&per_page=100>; rel="next",<https://harvest.greenhouse.io/v1/applications?page=1&per_page=100>; rel="prev",<https://harvest.greenhouse.io/v1/applications?page=326&per_page=100>; rel="last"'

Is there any way to extract this number automatically? Using the requests package? Or do I need to use regex or something?

Alternatively, should I somehow use a while loop to get all this data? What is the best way? Any thoughts?

517

asked Nov 17 '14 19:11

user3439329

1 Answers

The python requests library (http://docs.python-requests.org/en/latest/) can help here. The basic steps will be (1) all the request and grab the links from the header (you'll use this to get that last page info), and then (2) loop through the results until you're at that last page.

import requests

results = []
    
response = requests.get('https://harvest.greenhouse.io/v1/applications', auth=('APIKEY',''))
raw = response.json()  

for i in raw:  
    results.append(i) 

while response.links['next'] != response.links['last']:  
    r = requests.get(response.links['next'], auth=('APIKEY', '')  
    raw = r.json()  
    for i in raw:  
        results.append(i)

111

answered Oct 15 '22 19:10

tim_schaaf

Related questions
                            
                                python if statement too long and ugly, is there a way to shorten it [duplicate]
                            
                                Force setup.py to use my custom compiler
                            
                                Get last value of an OrderedDict in Python3
                            
                                AttributeError: 'str' object has no attribute 'fields' Using Django non rel on GAE
                            
                                Bringing a classifier to production
                            
                                Scrapy/Python/XPath - How to extract data from within data?
                            
                                Sqlalchemy Foreign key relationship error while creating tables
                            
                                Unable to pass an lxml etree object to a separate process
                            
                                3D tiling of a numpy array
                            
                                python scrapy parse() function, where is the return value returned to?
                            
                                Swapping maximum and minimum values in a list
                            
                                not able to get root window resize event
                            
                                numpy.savetxt() outputs very large files
                            
                                Storing the columns of a spreadsheet in a Python dictionary
                            
                                Flask and SQLAlchemy, application not registered on instance
                            
                                PIL Image.size returns the opposite width/height
                            
                                Flask sqlalchemy check whether object in db.session and ready for commit
                            
                                How to specify a parameter of type Array into a Django Command?
                            
                                How do I include a lot of vars into Ansible roles
                            
                                PsychoPy sending triggers on 64bit OS

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With