Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Create spark dataframe schema from json schema representation

Is there a way to serialize a dataframe schema to json and deserialize it later on?

The use case is simple: I have a json configuration file which contains the schema for dataframes I need to read. I want to be able to create the default configuration from an existing schema (in a dataframe) and I want to be able to generate the relevant schema to be used later on by reading it from the json string.

like image 711
Assaf Mendelson Avatar asked Dec 04 '16 10:12

Assaf Mendelson


People also ask

How do I infer schema from JSON file?

Spark SQL can automatically infer the schema of a JSON dataset and load it as a DataFrame. using the read. json() function, which loads data from a directory of JSON files where each line of the files is a JSON object. Note that the file that is offered as a json file is not a typical JSON file.


2 Answers

There are two steps for this: Creating the json from an existing dataframe and creating the schema from the previously saved json string.

Creating the string from an existing dataframe

    val schema = df.schema     val jsonString = schema.json 

create a schema from json

    import org.apache.spark.sql.types.{DataType, StructType}     val newSchema = DataType.fromJson(jsonString).asInstanceOf[StructType] 
like image 134
Assaf Mendelson Avatar answered Sep 21 '22 19:09

Assaf Mendelson


I am posting a pyspark version to a question answered by Assaf:

from pyspark.sql.types import StructType      # Save schema from the original DataFrame into json: schema_json = df.schema.json()  # Restore schema from json: import json new_schema = StructType.fromJson(json.loads(schema_json)) 
like image 26
mishkin Avatar answered Sep 21 '22 19:09

mishkin