Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to distributed fetching a list of keys on s3

Tags:

I am trying to distribute the process of getting a list of 60 million keys (file names) from s3.

Background: I am trying to process all files in a folder, about 60 million, via pyspark. As detailed HERE the typical sc.textFile('s3a://bucket/*') will load all of the data into the driver, and then distribute that to the cluster. The suggested method is to first acquire a list of files, parallelize the list, and then have each node fetch a subset of the files.

Problem: In this method there is still a bottleneck in the "acquire a list of files" step if that list is large enough. This step of getting a list of keys (file names) in an s3 bucket must also be distributed to be efficient.

What Ive Tried: I have tried two different methods:

  1. using the python aws api (boto3), which pages the results. Ideally we could estimate the number of pages, and distribute a range so that node 1 would request pages 1-100, node 2 would request pages 101-200, etc. Unfortunately you cannot specify an arbitrary page ID, you have to get the "next token" from the preceding page, aka a linked list of results.

  2. The aws cli, in which they allow for exclude and include filters. As the file names I am retrieving all start with an 8 digit int I could, in theory, have node one request the full file list for files which match 10* and the second node to request the full file list for file names which match 11* etc. This is done by:

    aws s3 --recursive --exclude="" include="10" s3://bucket/

Unfortunately it seems to be doing a full scan every request instead of using some index since it hangs for > 15 minutes per request.

Is there a way to make either solution viable? Is there a third option? Im sure I am not alone in having millions of s3 files which need to be digested.

like image 929
Chris.Caldwell Avatar asked Dec 30 '16 19:12

Chris.Caldwell


1 Answers

If you need a list of Amazon S3 content and you do not need it perfectly up-to-date, you could use Amazon S3 Storage Inventory, which will store a daily CSV listing of all files in an S3 bucket. You could then use that list to trigger your pyspark jobs.

On a similar bent, you could maintain a database of all files, with a process to update the database whenever objects are added to/removed from the bucket by using Amazon S3 Event Notifications. This way, the list is always up-to-date and accessible for your pyspark jobs.

like image 147
John Rotenstein Avatar answered Sep 23 '22 10:09

John Rotenstein