Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to import a CSV file into a BigQuery table without any column names or schema?

I'm currently writing a Java utility to import few CSV files from GCS into BigQuery. I can easily achieve this by bq load, but I wanted to do it using a Dataflow job. So I'm using Dataflow's Pipeline and ParDo transformer (returns TableRow to apply it on the BigQueryIO) and I have created the StringToRowConverter() for the transformation. Here the actual problem starts - I am forced to specify the schema for the destination table although I don't want to create a new table if it doesn't exist - only trying to load data. So I do not want to manually set the column name for the TableRow as I have about 600 columns.

public class StringToRowConverter extends DoFn<String, TableRow> {

private static Logger logger = LoggerFactory.getLogger(StringToRowConverter.class);

public void processElement(ProcessContext c) {
    TableRow row = new TableRow();          
    row.set("DO NOT KNOW THE COLUMN NAME", c.element());
    c.output(row);
}
}

Moreover, it is assumed that the table already exists in the BigQuery dataset and I don't need to create it, and also the CSV file contains the columns in a correct order.

If there's no workaround to this scenario and the column name is needed for the data load, then I can have it in the first row of the CSV file.

Any help will be appreciated.

like image 497
Vijin Paulraj Avatar asked Aug 18 '17 01:08

Vijin Paulraj


People also ask

How do I change the table schema in BigQuery?

In the console, go to the BigQuery page. In the Explorer panel, expand your project and dataset, then select the table. In the details panel, click the Schema tab. Click Edit schema.

How to export BigQuery table to CSV?

Follow the simple steps below to effortlessly Export BigQuery Table to CSV: Step 1: Go to the Google Cloud Console in BigQuery. Step 2: Navigate to the Explorer panel and select the desired table from your project. Step 3: From the details panel, click on the Export option and select Export to Cloud Storage.

How do I load CSV data from cloud storage into BigQuery?

You can load CSV data from Cloud Storage into a new BigQuery table by: To load CSV data from Cloud Storage into a new BigQuery table: Open the BigQuery page in the Cloud Console. In the navigation panel, in the Resources section, expand your project and select a dataset. On the right side of the window, in the details panel, click Create table.

Can I load data into BigQuery from a local data source?

For information about loading CSV data from a local file, see Loading data into BigQuery from a local data source. If you're new to Google Cloud, create an account to evaluate how BigQuery performs in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.

How do I import data from a CSV file in Excel?

After selecting the file, it will provide two options, to import the data in an existing table or create a new one. If selecting an existing table, then select one of the tables where you want to import the data. For importing csv into a new table, mention the name of the table.


1 Answers

To avoid the creation of the table, you should use the BigQueryIO.Write.CreateDisposition.CREATE_NEVER of the BigQueryIO.Write during the pipeline configuration. Source: https://cloud.google.com/dataflow/java-sdk/JavaDoc/com/google/cloud/dataflow/sdk/io/BigQueryIO.Write

You don't need to know a BigQuery table schema upfront, you can discover it dynamically. For instance, you can use the BigQuery API (https://cloud.google.com/bigquery/docs/reference/rest/v2/tables/get) to query a table schema and pass it as a parameter for class StringToRowConverter. Another option and assuming that first row is a header, is to skip the first row and use it to map the rest of the file correctly.

The code below implements the 2nd approach and also configures the output to append to an existing BigQuery table.

public class DFJob {

    public static class StringToRowConverter extends DoFn<String, TableRow> {

        private String[] columnNames;

        private boolean isFirstRow = true;

        public void processElement(ProcessContext c) {
            TableRow row = new TableRow();

            String[] parts = c.element().split(",");

            if (isFirstRow) {
                columnNames = Arrays.copyOf(parts, parts.length);
                isFirstRow = false;
            } else {
                for (int i = 0; i < parts.length; i++) {
                    row.set(columnNames[i], parts[i]);
                }
                c.output(row);
            }
        }
    }

    public static void main(String[] args) {
        DataflowPipelineOptions options = PipelineOptionsFactory.create()
                .as(DataflowPipelineOptions.class);
        options.setRunner(BlockingDataflowPipelineRunner.class);

        Pipeline p = Pipeline.create(options);

        p.apply(TextIO.Read.from("gs://dataflow-samples/myfile.csv"))
                .apply(ParDo.of(new StringToRowConverter()))
                .apply(BigQueryIO.Write.to("myTable")
                        .withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_NEVER)
                        .withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_APPEND));

        PipelineResult result = p.run();
    }
}
like image 98
fgasparini Avatar answered Oct 14 '22 10:10

fgasparini