Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

SparkSQL sql syntax for nth item in array

I have a json object that has an unfortunate combination of nesting and arrays. So its not totally obvious how to query it with spark sql.

here is a sample object:

{
  stuff: [
    {a:1,b:2,c:3}
  ]
}

so, in javascript, to get the value for c, I'd write myData.stuff[0].c

And in my spark sql query, if that array wasn't there, I'd be able to use dot notation:

SELECT stuff.c FROM blah

but I can't, because the innermost object is wrapped in an array.

I've tried:

SELECT stuff.0.c FROM blah // FAIL
SELECT stuff.[0].c FROM blah // FAIL

So, what is the magical way to select that data? or is that even supported yet?

like image 978
Kristian Avatar asked Jan 21 '16 05:01

Kristian


People also ask

How do I get the length of an array in Spark SQL?

If you are using Spark SQL, you can also use size() function that returns the size of an array or map type columns.

How do you display the contents of a data frame?

The only way to show the full column content we are using show() function. show(): Function is used to show the Dataframe. n: Number of rows to display. truncate: Through this parameter we can tell the Output sink to display the full column content by setting truncate option to false, by default this value is true.

What is SQLContext?

SQLContext is the entry point to SparkSQL which is a Spark module for structured data processing. Once SQLContext is initialised, the user can then use it in order to perform various “sql-like” operations over Datasets and Dataframes.


1 Answers

It is not clear what you mean by JSON object so lets consider two different cases:

  1. An array of structs

    import tempfile    
    
    path = tempfile.mktemp()
    with open(path, "w") as fw: 
        fw.write('''{"stuff": [{"a": 1, "b": 2, "c": 3}]}''')
    df = sqlContext.read.json(path)
    df.registerTempTable("df")
    
    df.printSchema()
    ## root
    ##  |-- stuff: array (nullable = true)
    ##  |    |-- element: struct (containsNull = true)
    ##  |    |    |-- a: long (nullable = true)
    ##  |    |    |-- b: long (nullable = true)
    ##  |    |    |-- c: long (nullable = true)
    
    sqlContext.sql("SELECT stuff[0].a FROM df").show()
    
    ## +---+
    ## |_c0|
    ## +---+
    ## |  1|
    ## +---+
    
  2. An array of maps

    # Note: schema inference from dictionaries has been deprecated
    # don't use this in practice
    df = sc.parallelize([{"stuff": [{"a": 1, "b": 2, "c": 3}]}]).toDF()
    df.registerTempTable("df")
    
    df.printSchema()
    ## root
    ##  |-- stuff: array (nullable = true)
    ##  |    |-- element: map (containsNull = true)
    ##  |    |    |-- key: string
    ##  |    |    |-- value: long (valueContainsNull = true)
    
    sqlContext.sql("SELECT stuff[0]['a'] FROM df").show()
    ## +---+
    ## |_c0|
    ## +---+
    ## |  1|
    ## +---+
    

See also Querying Spark SQL DataFrame with complex types

like image 67
zero323 Avatar answered Oct 01 '22 09:10

zero323