Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Config file to define JSON Schema Structure in PySpark

I have created a PySpark application that reads the JSON file in a dataframe through a defined Schema. code sample below

schema = StructType([
    StructField("domain", StringType(), True),
     StructField("timestamp", LongType(), True),                            
])
df= sqlContext.read.json(file, schema)

I need a way to find how can I define this schema in a kind of config or ini file etc. And read that in the main the PySpark application.

This will help me to modify schema for the changing JSON if there is any need in future without changing the main PySpark code.

like image 997
Puneet Babbar Avatar asked Jul 08 '16 23:07

Puneet Babbar


People also ask

How do I create a JSON Schema in spark?

Spark Read JSON with schemaUse the StructType class to create a custom schema, below we initiate this class and use add a method to add columns to it by providing the column name, data type and nullable option. The above example ignores the default schema and uses the custom schema while reading a JSON file.

How do you set a schema in Pyspark?

Creating StructType object struct from JSON fileYou can get the schema by using df2. schema. json() , store this in a file and will use it to create a the schema from this file. Alternatively, you could also use df.

How do you define a struct in Pyspark?

Construct a StructType by adding new elements to it, to define the schema. The method accepts either: A single parameter which is a StructField object. Between 2 and 4 parameters as (name, data_type, nullable (optional), metadata(optional).


1 Answers

You can create a JSON file named schema.json in the below format

{
  "fields": [
    {
      "metadata": {},
      "name": "first_fields",
      "nullable": true,
      "type": "string"
    },
    {
      "metadata": {},
      "name": "double_field",
      "nullable": true,
      "type": "double"
    }
  ],
  "type": "struct"
}

Create a struct schema from reading this file

rdd = spark.sparkContext.wholeTextFiles("s3://<bucket>/schema.json")
text = rdd.collect()[0][1]
dict = json.loads(str(text))
custom_schema = StructType.fromJson(dict)

After that, you can use struct as a schema to read the JSON file

val df=spark.read.json("path", custom_schema)
like image 135
ankursingh1000 Avatar answered Oct 25 '22 05:10

ankursingh1000