I have a json object that has an unfortunate combination of nesting and arrays. So its not totally obvious how to query it with spark sql.
here is a sample object:
{
stuff: [
{a:1,b:2,c:3}
]
}
so, in javascript, to get the value for c
, I'd write myData.stuff[0].c
And in my spark sql query, if that array wasn't there, I'd be able to use dot notation:
SELECT stuff.c FROM blah
but I can't, because the innermost object is wrapped in an array.
I've tried:
SELECT stuff.0.c FROM blah // FAIL
SELECT stuff.[0].c FROM blah // FAIL
So, what is the magical way to select that data? or is that even supported yet?
If you are using Spark SQL, you can also use size() function that returns the size of an array or map type columns.
The only way to show the full column content we are using show() function. show(): Function is used to show the Dataframe. n: Number of rows to display. truncate: Through this parameter we can tell the Output sink to display the full column content by setting truncate option to false, by default this value is true.
SQLContext is the entry point to SparkSQL which is a Spark module for structured data processing. Once SQLContext is initialised, the user can then use it in order to perform various “sql-like” operations over Datasets and Dataframes.
It is not clear what you mean by JSON object so lets consider two different cases:
An array of structs
import tempfile
path = tempfile.mktemp()
with open(path, "w") as fw:
fw.write('''{"stuff": [{"a": 1, "b": 2, "c": 3}]}''')
df = sqlContext.read.json(path)
df.registerTempTable("df")
df.printSchema()
## root
## |-- stuff: array (nullable = true)
## | |-- element: struct (containsNull = true)
## | | |-- a: long (nullable = true)
## | | |-- b: long (nullable = true)
## | | |-- c: long (nullable = true)
sqlContext.sql("SELECT stuff[0].a FROM df").show()
## +---+
## |_c0|
## +---+
## | 1|
## +---+
An array of maps
# Note: schema inference from dictionaries has been deprecated
# don't use this in practice
df = sc.parallelize([{"stuff": [{"a": 1, "b": 2, "c": 3}]}]).toDF()
df.registerTempTable("df")
df.printSchema()
## root
## |-- stuff: array (nullable = true)
## | |-- element: map (containsNull = true)
## | | |-- key: string
## | | |-- value: long (valueContainsNull = true)
sqlContext.sql("SELECT stuff[0]['a'] FROM df").show()
## +---+
## |_c0|
## +---+
## | 1|
## +---+
See also Querying Spark SQL DataFrame with complex types
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With