Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

PySpark, importing schema through JSON file

tbschema.json looks like this:

[{"TICKET":"integer","TRANFERRED":"string","ACCOUNT":"STRING"}]

I load it using following code

>>> df2 = sqlContext.jsonFile("tbschema.json")
>>> f2.schema
StructType(List(StructField(ACCOUNT,StringType,true),
    StructField(TICKET,StringType,true),StructField(TRANFERRED,StringType,true)))
>>> df2.printSchema()
root
 |-- ACCOUNT: string (nullable = true)
 |-- TICKET: string (nullable = true)
 |-- TRANFERRED: string (nullable = true)
  1. Why does the schema elements gets sorted, when I want the elements in the same order as they appear in the JSON.

  2. The data type integer has been converted into StringType after the JSON has been derived, how do I retain the datatype.

like image 994
sachin Avatar asked Aug 15 '15 18:08

sachin


People also ask

How do I read a JSON schema in spark?

Spark SQL can automatically infer the schema of a JSON dataset and load it as a DataFrame. using the read. json() function, which loads data from a directory of JSON files where each line of the files is a JSON object. Note that the file that is offered as a json file is not a typical JSON file.

How do I read and write JSON files in Pyspark?

The "multiline_dataframe" value is created for reading records from JSON files that are scattered in multiple lines so, to read such files, use-value true to multiline option and by default multiline option is set to false. Finally, the PySpark dataframe is written into JSON file using "dataframe. write. mode().


1 Answers

Why does the schema elements gets sorted, when i want the elemets in the same order as they appear in the json.

Because order of fields is not guaranteed. While it is not explicitly stated it becomes obvious when you take a look a the examples provided in the JSON reader doctstring. If you need specific ordering you can provide schema manually:

from pyspark.sql.types import StructType, StructField, StringType

schema = StructType([
    StructField("TICKET", StringType(), True),
    StructField("TRANFERRED", StringType(), True),
    StructField("ACCOUNT", StringType(), True),
])
df2 = sqlContext.read.json("tbschema.json", schema)
df2.printSchema()

root
 |-- TICKET: string (nullable = true)
 |-- TRANFERRED: string (nullable = true)
 |-- ACCOUNT: string (nullable = true)

The data type integer has been converted into StringType after the json has been derived, how do i retain the datatype.

Data type of JSON field TICKET is string hence JSON reader returns string. It is JSON reader not some-kind-of-schema reader.

Generally speaking you should consider some proper format which comes with schema support out-of-the-box, for example Parquet, Avro or Protocol Buffers. But if you really want to play with JSON you can define poor man's "schema" parser like this:

from collections import OrderedDict 
import json

with open("./tbschema.json") as fr:
    ds = fr.read()

items = (json
  .JSONDecoder(object_pairs_hook=OrderedDict)
  .decode(ds)[0].items())

mapping = {"string": StringType, "integer": IntegerType, ...}

schema = StructType([
    StructField(k, mapping.get(v.lower())(), True) for (k, v) in items])

Problem with JSON is that there is really no guarantee regarding fields ordering whatsoever, not to mention handling missing fields, inconsistent types and so on. So using solution as above really depends on how much you trust your data.

Alternatively you can use built-in schema import / export utilities.

like image 119
zero323 Avatar answered Oct 15 '22 02:10

zero323