Writing nested schema to BigQuery from Dataflow (Python)

Tags:

I have a Dataflow job to write to BigQuery. It works well for non-nested schema, however fails for the nested schema.

Here is my Dataflow pipeline:

pipeline_options = PipelineOptions()
  p = beam.Pipeline(options=pipeline_options)

  wordcount_options = pipeline_options.view_as(WordcountTemplatedOptions)

  schema = 'url: STRING,' \
           'ua: STRING,' \
           'method: STRING,' \
           'man: RECORD,' \
           'man.ip: RECORD,' \
           'man.ip.cc: STRING,' \
           'man.ip.city: STRING,' \
           'man.ip.as: INTEGER,' \
           'man.ip.country: STRING,' \
           'man.res: RECORD,' \
           'man.res.ip_dom: STRING'

  first = p | 'read' >> ReadFromText(wordcount_options.input)
  second = (first
            | 'process' >> (beam.ParDo(processFunction()))
            | 'write' >> beam.io.WriteToBigQuery(
              'myBucket:tableFolder.test_table',
              schema=schema)
  )

I created BigQuery Table using the following Schema is:

[
  {
    "mode": "NULLABLE",
    "name": "url",
    "type": "STRING"
  },
  {
    "mode": "NULLABLE",
    "name": "ua",
    "type": "STRING"
  },
  {
    "mode": "NULLABLE",
    "name": "method",
    "type": "STRING"
  },
  {
    "mode": "REPEATED",
    "name": "man",
    "type": "RECORD",
    "fields":
      [
        {
          "mode": "REPEATED",
          "name": "ip",
          "type": "RECORD",
          "fields":
            [
              {
                "mode": "NULLABLE",
                "name": "cc",
                "type": "STRING"
              },
              {
                "mode": "NULLABLE",
                "name": "city",
                "type": "STRING"
              },
              {
                "mode": "NULLABLE",
                "name": "as",
                "type": "INTEGER"
              },
              {
                "mode": "NULLABLE",
                "name": "country",
                "type": "STRING"
              }
            ]
        },
        {
          "mode": "REPEATED",
          "name": "res",
          "type": "RECORD",
          "fields":
            [
              {
                "mode": "NULLABLE",
                "name": "ip_dom",
                "type": "STRING"
              }
            ]
        }
      ]
  }
]

I am getting the following error:

BigQuery creation of import job for table "test_table" in dataset "tableFolder" in project "myBucket" failed., BigQuery execution failed., HTTP transport error:
 Message: Invalid value for: url is not a valid value
 HTTP Code: 400

Question Can someone please guide me? What am I doing wrong? Also, If there is a better way to iterate through all the nested schema and write to BigQuery please suggest?

Additional info My data file:

{"url":"xyz.com","ua":"Mozilla/5.0 Chrome/63","method":"PUT","man":{"ip":{"cc":"IN","city":"delhi","as":274,"country":"States"},"res":{"ip_dom":"v1"}}}
{"url":"xyz.com","ua":"Mozilla/5.0 Chrome/63","method":"PUT","man":{"ip":{"cc":"DK","city":"munlan","as":4865,"country":"United"},"res":{"ip_dom":"v1"}}}
{"url":"xyz.com","ua":"Mozilla/5.0 Chrome/63","method":"GET","man":{"ip":{"cc":"BS","city":"sind","as":7655,"country":"India"},"res":{"ip_dom":"v1"}}}

935

asked Feb 12 '18 07:02

Preetesh Gaitonde

1 Answers

The problem with your code is that you try to use nested fields while specifying BigQuery Table Schema as string, which is not supported. In order to push nested records into BigQuery from Apache Beam you need to create TableSchema object, i.e using built-in parser:

from apache_beam.io.gcp.bigquery import parse_table_schema_from_json
table_schema = parse_table_schema_from_json(your_bigquery_json_schema)

You need to pass schema as JSON string there, you can obtain it using the following command in your terminal (I assume that you have gcloud tools installed):

bq --project=your-gcp-project-name --format=json show your.table.name > schema.json

and in Python use it as follows:

table_schema = parse_table_schema_from_json(json.dumps(json.load(open("schema.json"))["schema"]))

Then in your pipeline:

 beam.io.WriteToBigQuery(
              'myBucket:tableFolder.test_table',
              schema=table_schema)

You can also take a look at the example showing manual creation of TableSchema object: https://github.com/apache/beam/blob/474345f5987e47a22d063c7bfcb3638c85a57e64/sdks/python/apache_beam/examples/cookbook/bigquery_schema.py

which is (from the linked example):

from apache_beam.io.gcp.internal.clients import bigquery
table_schema = bigquery.TableSchema()
full_name_schema = bigquery.TableFieldSchema()
full_name_schema.name = 'fullName'
full_name_schema.type = 'string'
full_name_schema.mode = 'required'
table_schema.fields.append(full_name_schema)

# A nested field
phone_number_schema = bigquery.TableFieldSchema()
phone_number_schema.name = 'phoneNumber'
phone_number_schema.type = 'record'
phone_number_schema.mode = 'nullable'
number = bigquery.TableFieldSchema()
number.name = 'number'
number.type = 'integer'
number.mode = 'nullable'
phone_number_schema.fields.append(number)

table_schema.fields.append(phone_number_schema)
area_code = bigquery.TableFieldSchema()
area_code.name = 'areaCode'
area_code.type = 'integer'
area_code.mode = 'nullable'
phone_number_schema.fields.append(area_code)
table_schema.fields.append(phone_number_schema)

then just use table_schema variable in beam.io.WriteToBigQuery.

142

answered Nov 14 '22 23:11

Marcin Zablocki

Related questions
                            
                                python - Output by OpenCV VideoWriter empty
                            
                                Error 403 : HTTP status code is not handled or not allowed in scrapy
                            
                                pyglet - loading/blitting image with alpha
                            
                                Django redirect_authenticated_user: True not working
                            
                                django many-to-many recursive relationship
                            
                                matplotlib - could not convert string to float
                            
                                Parallel threading with xgboost?
                            
                                Add Missing Columns to the dataframe
                            
                                Regular expression - replace all spaces in beginning of line with periods
                            
                                how to concatenate multiple excel sheets from the same file?
                            
                                How to get ALL undefined variables from a Jinja2 template?
                            
                                Pythonic way to check empty dictionary and empty values
                            
                                Replace column values using regex in pandas data frame
                            
                                Python Dataframe: Remove duplicate words in the same cell within a column in Python [closed]
                            
                                Can you use a while loop on a dictionary in python?
                            
                                Can't find Python executable
                            
                                How to catch all exceptions in Try/Catch Block Python?
                            
                                How to groupby().transform() to value_counts() in pandas?
                            
                                How to create a pandas dataframe using Tweepy?
                            
                                Multiple tests in one pytest function

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Writing nested schema to BigQuery from Dataflow (Python)

Tags:

python

google-cloud-platform

google-bigquery

apache-beam

google-cloud-dataflow

Preetesh Gaitonde

People also ask

1 Answers

Marcin Zablocki

Recent Activity

Donate For Us