Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to concatenate a million files in Google Cloud Storage

According to the documentation for Google Cloud Storage, there are a few limitations on using gsutil compose (see below).

Is there a more efficient way to combine a large number of files in the same bucket (~1 million)?

If I understand correctly, I would have to join groups of 32, then keep doing that and joining again?

Note that there is a limit (currently 32) to the number of components that can be composed in a single operation.

There is a limit (currently 1024) to the total number of components for a given composite object. This means you can append to each object at most 1023 times.

There is a per-project rate limit (currently 200) to the number of components you can compose per second. This rate counts both the components being appended to a composite object as well as the components being copied when the composite object of which they are a part is copied.

like image 326
d-_-b Avatar asked Mar 26 '17 03:03

d-_-b


2 Answers

GCS no longer enforces a component count limit. You can combine 1 million files as long as the newly created object is <= 5 TiB. You still have to join the files in groups of 32 by composing recursively, as documented here.

A simple way to do this in serial is by appending to a single object by repeatedly overwriting it. For example:

  1. Upload the files as objects {Fi, Fi+1, ... Ftotal}
  2. Compose objects {F1, F2, ... F31} to create composite object X
  3. Recursively compose objects {X, Fi+1, Fi+2, ... Fi+30} to overwrite X with a new object

Since the compose per-project rate limit has also been lifted, you can also do this in parallel by composing in batches to temporary objects, then deleting the temporary objects.

  1. Compose objects {Fi, Fi+1, ... Fi+31} to create composite object X(1)i / 32
  2. Recursively compose objects {X(m)j, X(m)j+1, ... X(m)j+31} to create composite object X(m+1)j / 32
  3. Delete all temporary objects X(m)j

The only caveat is that the componentCount metadata property saturates at 2,147,483,647, even if the object has > 2,147,483,647 components. If you don't depend on componentCount being accurate, then this should not be a problem, since componentCount does not affect whether compose succeeds or not.

like image 103
Mike Scarlett Avatar answered Nov 06 '22 11:11

Mike Scarlett


Unfortunately, combining groups of 32 over and over again won't work, due to the "grand total" components limit of 1024.

Instead, what you'd have to do is this:

  1. Let's name the set of 1 million original files A (~1,000,000 objects).
  2. Call compose on each group of 32 objects in A, producing set B (~30,000 objects). Each object in B has a component count of 32.
  3. Call compose on groups of each group of 32 objects in B, producing set C (~1000 objects). These new objects will have 32*32 components each, or 1024. That's exactly the limit. You cannot compose them directly any further.
  4. Call "rewrite" on each element of C. This will reset the component count back to 1.
  5. Call compose on each group of 32 elements in C, producing set D (~30 objects).
  6. Call compose once to combine all of D.

Much of this work can be done in parallel, which would greatly speed things up.

like image 34
Brandon Yarbrough Avatar answered Nov 06 '22 11:11

Brandon Yarbrough