<code>tbschema.json</code> looks like this: <pre class="prettyprint"><code>[{"TICKET":"integer","TRANFERRED":"string","ACCOUNT":"STRING"}] </code></pre> I load it using following code <pre class="prettyprint"><code>>>> df2 = sqlContext.jsonFile("tbschema.json") >>> f2.schema StructType(List(StructField(ACCOUNT,StringType,true), StructField(TICKET,StringType,true),StructField(TRANFERRED,StringType,true))) >>> df2.printSchema() root |-- ACCOUNT: string (nullable = true) |-- TICKET: string (nullable = true) |-- TRANFERRED: string (nullable = true) </code></pre> <ol> <li>Why does the schema elements gets sorted, when I want the elements in the same order as they appear in the JSON.</li> <li>The data type integer has been converted into StringType after the JSON has been derived, how do I retain the datatype.</li> </ol>

<blockquote> Why does the schema elements gets sorted, when i want the elemets in the same order as they appear in the json. </blockquote> Because order of fields is not guaranteed. While it is not explicitly stated it becomes obvious when you take a look a the examples provided in the JSON reader doctstring. If you need specific ordering you can provide schema manually: <pre class="prettyprint"><code>from pyspark.sql.types import StructType, StructField, StringType schema = StructType([ StructField("TICKET", StringType(), True), StructField("TRANFERRED", StringType(), True), StructField("ACCOUNT", StringType(), True), ]) df2 = sqlContext.read.json("tbschema.json", schema) df2.printSchema() root |-- TICKET: string (nullable = true) |-- TRANFERRED: string (nullable = true) |-- ACCOUNT: string (nullable = true) </code></pre> <blockquote> The data type integer has been converted into StringType after the json has been derived, how do i retain the datatype. </blockquote> Data type of JSON field <code>TICKET</code> is string hence JSON reader returns string. It is JSON reader not some-kind-of-schema reader. Generally speaking you should consider some proper format which comes with schema support out-of-the-box, for example Parquet, Avro or Protocol Buffers. But if you really want to play with JSON you can define poor man's "schema" parser like this: <pre class="prettyprint"><code>from collections import OrderedDict import json with open("./tbschema.json") as fr: ds = fr.read() items = (json .JSONDecoder(object_pairs_hook=OrderedDict) .decode(ds)[0].items()) mapping = {"string": StringType, "integer": IntegerType, ...} schema = StructType([ StructField(k, mapping.get(v.lower())(), True) for (k, v) in items]) </code></pre> Problem with JSON is that there is really no guarantee regarding fields ordering whatsoever, not to mention handling missing fields, inconsistent types and so on. So using solution as above really depends on how much you trust your data. Alternatively you can use built-in schema import / export utilities.

PySpark, importing schema through JSON file

Tags:

python

json

apache-spark

apache-spark-sql

pyspark

tbschema.json looks like this:

[{"TICKET":"integer","TRANFERRED":"string","ACCOUNT":"STRING"}]

I load it using following code

>>> df2 = sqlContext.jsonFile("tbschema.json")
>>> f2.schema
StructType(List(StructField(ACCOUNT,StringType,true),
    StructField(TICKET,StringType,true),StructField(TRANFERRED,StringType,true)))
>>> df2.printSchema()
root
 |-- ACCOUNT: string (nullable = true)
 |-- TICKET: string (nullable = true)
 |-- TRANFERRED: string (nullable = true)

Why does the schema elements gets sorted, when I want the elements in the same order as they appear in the JSON.
The data type integer has been converted into StringType after the JSON has been derived, how do I retain the datatype.

994

asked Aug 15 '15 18:08

sachin

1 Answers

Why does the schema elements gets sorted, when i want the elemets in the same order as they appear in the json.

Because order of fields is not guaranteed. While it is not explicitly stated it becomes obvious when you take a look a the examples provided in the JSON reader doctstring. If you need specific ordering you can provide schema manually:

from pyspark.sql.types import StructType, StructField, StringType

schema = StructType([
    StructField("TICKET", StringType(), True),
    StructField("TRANFERRED", StringType(), True),
    StructField("ACCOUNT", StringType(), True),
])
df2 = sqlContext.read.json("tbschema.json", schema)
df2.printSchema()

root
 |-- TICKET: string (nullable = true)
 |-- TRANFERRED: string (nullable = true)
 |-- ACCOUNT: string (nullable = true)

The data type integer has been converted into StringType after the json has been derived, how do i retain the datatype.

Data type of JSON field TICKET is string hence JSON reader returns string. It is JSON reader not some-kind-of-schema reader.

Generally speaking you should consider some proper format which comes with schema support out-of-the-box, for example Parquet, Avro or Protocol Buffers. But if you really want to play with JSON you can define poor man's "schema" parser like this:

from collections import OrderedDict 
import json

with open("./tbschema.json") as fr:
    ds = fr.read()

items = (json
  .JSONDecoder(object_pairs_hook=OrderedDict)
  .decode(ds)[0].items())

mapping = {"string": StringType, "integer": IntegerType, ...}

schema = StructType([
    StructField(k, mapping.get(v.lower())(), True) for (k, v) in items])

Problem with JSON is that there is really no guarantee regarding fields ordering whatsoever, not to mention handling missing fields, inconsistent types and so on. So using solution as above really depends on how much you trust your data.

Alternatively you can use built-in schema import / export utilities.

119

answered Oct 15 '22 02:10

zero323

Related questions
                            
                                Synchronous and blocking consumption in RabbitMQ using pika
                            
                                Intersection of two lists, keeping duplicates in the first list
                            
                                S3 using boto and SigV4 - missing host parameter
                            
                                Python Matplotlib line plot aligned with contour/imshow
                            
                                I need the server to send messages to all clients (Python, sockets)
                            
                                Issue when running schedule with Flask
                            
                                How do I save a workbook using xlwings?
                            
                                What should I use instead of Bootstrap?
                            
                                Filling date gaps in pandas dataframe
                            
                                MATLAB ksdensity equivalent in Python
                            
                                Pandas scalar value getting and setting: ix or iat?
                            
                                Python: Iterate over each item in nested-list-of-lists and replace specific items
                            
                                Why does this solve the 'no $DISPLAY environment' issue with matplotlib?
                            
                                Updating Anaconda's root Python to newer minor version on Windows does nothing
                            
                                Pandas, groupby where column value is greater than x
                            
                                Is it possible to run only a single step of the asyncio event loop
                            
                                How do i plot facet plots in pandas
                            
                                Can you use a concept similar to keyword args for python in Java to minimize the number of accessor methods?
                            
                                PyCharm show full diff when unittest fails for multiline string?
                            
                                PyMC3 & Theano - Theano code that works stop working after pymc3 import

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With