Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Migrate csv from gcs to postgresql

I'm trying to migrate csv files from Google Cloud Storage (GCS), which have been exported from BigQuery, to a PostgreSQL Google cloud sql instance using a python script.

I was hoping to use the Google API but found this in the documentation:

Importing CSV data using the Cloud SQL Admin API is not supported for PostgreSQL instances.

As an alternative I could use psycopg2 library and stream the rows of the csv file into the SQL instance. I can do this three ways

  • Line by line: Read each line and then submit the insert command and then commit
  • Batch stream: Read each line and then submit the insert commands and then commit after 10 lines or 100 etc.
  • The entire csv: Read each line and submit the insert commands and then only commit at the end of the document.

My concerns are these csv files could contain millions of rows and running this process for any of the three options mentioned above seems like a bad idea to me.

What alternatives do I have? Essentially I have some raw data in BigQuery on which we do some preprocessing before exporting to GCS in preparation for importing to the PostgreSQL instance. I need to export this preprocessed data from BigQuery to the PostgreSQL instance.

This is not a duplicate of this question as I'm preferably looking for the solution which exports data from BigQuery to the PostgreSQL instance wether it be via GCS or direct.

like image 688
DJ319 Avatar asked Oct 03 '18 08:10

DJ319


People also ask

Can you import CSV into PostgreSQL?

PgAdmin is a Graphical User Interface (GUI) that allows businesses to import data into PostgreSQL databases. With this service, you can convert CSV files into acceptable PostgreSQL database formats, and import the CSV into your PostgreSQL format.

How do I import a CSV file into SQL Cloud?

In the Google Cloud console, go to the Cloud SQL Instances page. To open the Overview page of an instance, click the instance name. Click Import. In the Choose the file you'd like to import data from section, enter the path to the bucket and CSV file to use for the import.


2 Answers

You can do the import process with Cloud Dataflow as suggested by @GrahamPolley. It's true that this solution involves some extra work (getting familiar with Dataflow, setting everything up, etc). Even with the extra work, this would be the preferred solution for your situation. However, other solutions are available and I'll explain one of them below.

To set up a migration process with Dataflow, this tutorial about exporting BigQuery to Google Datastore is a good example


Alternative solution to Cloud Dataflow

Cloud SQL for PostgreSQL doesn't support importing from a .CSV but it does support .SQL files.

The file type for the specified uri.
SQL: The file contains SQL statements.
CSV: The file contains CSV data. Importing CSV data using the Cloud SQL Admin API is not supported for PostgreSQL instances.

A direct solution would be to convert the .CSV filest to .SQL with some tool (Google doens't provide one that I know of, but there are many online) and then import to the PostgreSQL.

If you want to implement this solution in a more "programatic" way, I would suggest to use Cloud Functions, here is an example of how I would try to do it:

  1. Set up a Cloud Function that triggers when a file is uploaded to a Cloud Storage bucket
  2. Code the function to get the uploaded file and check if it's a .CSV. If it is, use a csv-to-sql API (example of API here) to convert the file to .SQL
  3. Store the new file in Cloud Storage
  4. Import to the PostgreSQL
like image 127
Guillermo Cacheda Avatar answered Sep 22 '22 10:09

Guillermo Cacheda


Before you begin, you should make sure:

The database and table you are importing into must already exist on your Cloud SQL instance.

CSV file format requirements CSV files must have one line for each row of data and have comma-separated fields.

Then, you can import data to a Cloud SQL instance using a CSV file present in a GCS bucket following the next steps [GCLOUD]

  1. Describe the instance you are exporting from:

gcloud sql instances describe [INSTANCE_NAME]

  1. Copy the serviceAccountEmailAddress field.

  2. Add the service account to the bucket ACL as a writer:

gsutil acl ch -u [SERVICE_ACCOUNT_ADDRESS]:W gs://[BUCKET_NAME]

  1. Add the service account to the import file as a reader:

gsutil acl ch -u [SERVICE_ACCOUNT_ADDRESS]:R gs://[BUCKET_NAME]/[IMPORT_FILE_NAME]

  1. Import the file

gcloud sql import csv [INSTANCE_NAME] gs://[BUCKET_NAME]/[FILE_NAME] \ --database=[DATABASE_NAME] --table=[TABLE_NAME]

  1. If you do not need to retain the permissions provided by the ACL you set previously, remove the ACL:

gsutil acl ch -d [SERVICE_ACCOUNT_ADDRESS] gs://[BUCKET_NAME]

like image 20
Tiago Martins Peres Avatar answered Sep 19 '22 10:09

Tiago Martins Peres