Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Any feature in BigQuery that can migrate a whole dataset in another project w/o executing copy data?

While our project grows, at some point we realized that we need to create new projects and reorganize our dataset. One case is that we need to isolate one dataset from others into another new project. I know that I can do it by copying tables one by one through API and then delete the old ones. But when it comes to over a thousand of tables, it's really consumes a lot of time as the copying api is executed as a job and it takes time. Is it possible to just change reference(or path) of a dataset?

Follow up I tried copy tables using batch request. I got 200 OK in all request, but the tables just didn't get copied. I wonder why and how to get the real result. Here's my code:

    public async Task CopyTableToProjectInBatchAsync(IList<TableList.TablesData> fromTables, string toProjectId)
    {
        var request = new BatchRequest(BigQueryService);
        foreach (var tableData in fromTables)
        {
            string fromDataset = tableData.TableReference.DatasetId;
            string fromTableId = tableData.TableReference.TableId;
            Logger.Info("copying table {0}...",tableData.Id);
            request.Queue<JobReference>(CreateTableCopyRequest(fromDataset, fromTableId, toProjectId),
            (content, error, i, message) =>
            {
                Logger.Info("#content:\n" + content);
                Logger.Info("#error:\n" + error);
                Logger.Info("#i:\n" + i);
                Logger.Info("#message:\n" + message);
            });
        }
        await request.ExecuteAsync();
    }

   private IClientServiceRequest CreateTableCopyRequest(string fromDatasetId, string fromTableId, string toProjectId,
        string toDatasetId=null, string toTableId=null)
    {
        if (toDatasetId == null)
            toDatasetId = fromDatasetId;
        if (toTableId == null)
            toTableId = fromTableId;
        TableReference sourceTableReference = new TableReference
        {
            ProjectId = _account.ProjectId,
            DatasetId = fromDatasetId,
            TableId = fromTableId
        };
        TableReference targetTableReference = new TableReference
        {
            ProjectId = toProjectId,
            DatasetId = toDatasetId,
            TableId = toTableId
        };
        JobConfigurationTableCopy copyConfig = new JobConfigurationTableCopy
        {
            CreateDisposition = "WRITE_TRUNCATE",
            DestinationTable = targetTableReference,
            SourceTable = sourceTableReference
        };
        JobReference jobRef = new JobReference {JobId = GenerateJobID("copyTable"), ProjectId = _account.ProjectId};
        JobConfiguration jobConfig = new JobConfiguration {Copy = copyConfig};
        Job job = new Job {Configuration = jobConfig, JobReference = jobRef};

        return BigQueryService.Jobs.Insert(job, _account.ProjectId);
    }
like image 735
foxwendy Avatar asked Sep 22 '15 18:09

foxwendy


2 Answers

You can first copy BigQuery dataset to the new project, then delete the original dataset.

The copy dataset UI is similar to copy table. Just click "copy dataset" button from the source dataset, and specify the destination dataset in the pop-up form. See screenshot below. Check out the public documentation for more use cases.

Copy dataset button

enter image description here

Copy dataset form

enter image description here

like image 106
Jian He Avatar answered Nov 21 '22 14:11

Jian He


There's no built-in feature but I helped write a tool that we've open-sourced that will do this for you: https://github.com/uswitch/big-replicate.

It will let you synchronise/copy tables between projects or datasets (within the same project). Most of the details are in the project's README but for reference it looks a little like:

java -cp big-replicate-standalone.jar \
  uswitch.big_replicate.sync \
  --source-project source-project-id \
  --source-dataset 98909919 \
  --destination-project destination-project-id \
  --destination-dataset 98909919

You can set options that will control how many tables to copy, how many jobs run concurrently and where to store the intermediate data in Cloud Storage. The destination dataset must already exist but this means you'll be able to copy data between locations too (US, EU, Asia etc.).

Binaries are built on CircleCI and published to GitHub releases.

like image 27
pingles Avatar answered Nov 21 '22 16:11

pingles