Spark schema management at single place

Question

What is the best way to manage Spark tables' schemas? Do you see any drawbacks of Option 2? May you suggest any better alternatives?

Solutions I see

Option 1: keep separate definitions for code and for metastore

The drawback of this is approach is that you have continuously keep them in sync (error prone). Another drawback - it gets cumbersome if the table has 500 columns.

create_some_table.sql [1st definition]

-- Databricks syntax (internal metastore)
CREATE TABLE IF NOT EXISTS some_table (
  Id int,
  Value string,
  ...
  Year int
)
USING PARQUET
PARTITION BY (Year)
OPTIONS (
  PATH 'abfss://...'
)

some_job.py [2nd definition]

def run():
   df = spark.read.table('input_table')  # 500 columns
   df = transorm(df)
   # this logic should be in `transform`, but anycase it should be
   df = df.select(
     'Id', 'Year', F.col('Value').cast(StringType()).alias('Value')  # actually another schema definition: you have to enumerate all output columns
   )
   df.write.saveAsTable('some_table')

test_some_job.py [3rd definition]

def test_some_job(spark):
   output_schema = ...  # another definition
   expected = spark.createDataFrame([...], output_schema)

Option 2: keep only one definition in code (StructType)

It's possible to generate schema on the fly. The benefit of this method - is simplicity and schema definition in single place. Do you see any drawbacks?

def run(input: Table, output: Table):
   df = spark.read.table(input.name)
   df = transform(df)
   save(df, output)    

def save(df: DataFrame, table: Table): 
    df \
        .select(table.schema.fieldNames()) \
        .write \
        .partitionBy(table.partition_by) \
        .option('path', table.path) \
        .saveAsTable(table.name)
    # In case table doesn't exists, Databricks will automatically generate table definition
        
class Table(NamedTuple):
    name: str
    path: str
    partition_by: List[str]
    schema: StructType

610

asked Aug 14 '20 21:08

VB_

1 Answers

Let me first make a few points then a recommendation.

Data lives a lot longer than code.
Code described above is code that creates & writes the data, there is also code that reads and consumes the data that needs to be considered.
There's a 3rd option, storing the definition of the data (schema) with the data. Often called a 'self describing format'
The structure of data can change over time.
This question is tagged with databricks and aws-glue
Parquet is self describing on a file by file basis.
Delta Lake tables use parquet data files, but additionally embed the schema into the transaction log and is thus the entire table and schema is versioned.
Data needs to be used by a wide eco system of tools, thus the data needs to be discoverable, the schema should not be locked into one compute engine.

Recommendation:

Store the schema with the data in an open format
Use Delta Lake format (which combines Parquet and a transaction log)
Change USING PARQUET to USING DELTA
Point your meta store to AWS Glue Catalog, Glue catalog will store the table name, and location
Consumers will resolve the schema from the Delta Lake table transaction log
Schema can evolve as the writer code evolves.

Results:

Your writer creates the schema, and may optionally evolve the schema
All consumers will find the schema (paired with the table version) in the Delta Lake (_delta_log dir to be specific)

answered Sep 18 '22 11:09

Douglas M

Related questions
                            
                                Why does std::pair have two different constructors for both const reference and forwarding reference parameter?
                            
                                Cordova app stuck on splash screen on iOS 14 Beta
                            
                                Cannot spyOn on a primitive value; undefined given . Vue JS, Jest, Utils
                            
                                Change the log location (i.e., `LogDevice`) in a Ruby on Rails (version 6) application
                            
                                Char and Int printing two different values
                            
                                Tensorflow image_dataset_from_directory for input dataset and output dataset
                            
                                Why does this Google Sheets formula for the last non-empty cell in a column work?
                            
                                Fastest Way to Zero dataframe/column in Python Pandas
                            
                                Kotlin/Native multithreading using coroutines
                            
                                Python Type annotation for Type Annotation
                            
                                How to hide/exclude certain fields of foreign key in graphene_django wrt the requested entity?
                            
                                Is there any way to have flexbox equal height columns with baseline font alignment of text with varying font sizes and font familys?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Spark schema management at single place

Tags:

Question

Solutions I see

VB_

People also ask

1 Answers

Douglas M

Recent Activity

Donate For Us