Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How does paging work in the list_blobs function in Google Cloud Storage Python Client Library

I want to get a list of all the blobs in a Google Cloud Storage bucket using the Client Library for Python.

According to the documentation I should use the list_blobs() function. The function appears to use two arguments max_results and page_token to achieve paging. I am not sure how use them.

In particular, where do I get the page_token from?

I would have expected that list_blobs() would provide a page_token for use in subsequent calls, but I cannot find any documentation on it.

In addition, max_results is optional. What happens if I don't provide it? Is there a default limit? If so, what is it?

like image 210
user2771609 Avatar asked Mar 31 '17 18:03

user2771609


People also ask

How do I get a list of files in a GCS bucket?

In the Google Cloud console, go to the Cloud Storage Buckets page. In the bucket list, click on the name of the bucket whose contents you want to view. Optionally, use filtering to narrow the results in your list.

Is a Python application that lets the user access Google Cloud Storage from the command line?

gsutil is a Python application that lets you access Cloud Storage from the command line, providing you with the ability to do all sorts of things like creating buckets, moving objects, or even editing metadata.

What is GCS client?

Google Cloud Storage Python Client.


2 Answers

list_blobs() does use paging, but you do not use page_token to achieve it.

How It Works:

The way list_blobs() work is that it returns an iterator that iterates through all the results doing paging behind the scenes. So simply doing this will get you through all the results, fetching pages as needed:

for blob in bucket.list_blobs()
    print blob.name

The Documentation is Wrong/Misleading:

As of 04/26/2017 this is what the docs says:

page_token (str) – (Optional) Opaque marker for the next “page” of blobs. If not passed, will return the first page of blobs.

This implies that the result will be a single page of results with page_token determining which page. This is not correct. The result iterator iterates through multiple pages. What page_token actually represents is which page the iterator should START at. It no page_token is provided it will start at the first page.

Helpful To Know:

max_results limits the total number of results returned by the iterator.

The iterator does expose pages if you need it:

for page in bucket.list_blobs().pages:
    for blob in page:
        print blob.name
like image 153
user2771609 Avatar answered Sep 19 '22 00:09

user2771609


Please read the inline comments:

from google.cloud import storage

storage = storage.Client()

bucket_name = ''  # Fill here your bucket name

# This will limit number of results - replace this with None in order to get all the blobs in the bucket
max_results = 23_344 

# Please specify the "nextPageToken" in order to trigger an implicit pagination 
# (which is managed for you by the library).
# Moreover, you'll need to specify the "items" with all the fields you would like to fetch.
# Here are the supported fields: https://cloud.google.com/storage/docs/json_api/v1/objects#resource

fields = 'items(name),nextPageToken'

counter = 0
for blob in storage.list_blobs(bucket_name, fields=fields, max_results=max_results):
    counter += 1
    print(counter, ')', blob.name)

like image 29
Victor Klapholz Avatar answered Sep 20 '22 00:09

Victor Klapholz