Is there a way to serialize a dataframe schema to json and deserialize it later on?
The use case is simple: I have a json configuration file which contains the schema for dataframes I need to read. I want to be able to create the default configuration from an existing schema (in a dataframe) and I want to be able to generate the relevant schema to be used later on by reading it from the json string.
Spark SQL can automatically infer the schema of a JSON dataset and load it as a DataFrame. using the read. json() function, which loads data from a directory of JSON files where each line of the files is a JSON object. Note that the file that is offered as a json file is not a typical JSON file.
There are two steps for this: Creating the json from an existing dataframe and creating the schema from the previously saved json string.
Creating the string from an existing dataframe
val schema = df.schema val jsonString = schema.json
create a schema from json
import org.apache.spark.sql.types.{DataType, StructType} val newSchema = DataType.fromJson(jsonString).asInstanceOf[StructType]
I am posting a pyspark version to a question answered by Assaf:
from pyspark.sql.types import StructType # Save schema from the original DataFrame into json: schema_json = df.schema.json() # Restore schema from json: import json new_schema = StructType.fromJson(json.loads(schema_json))
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With