Edit: this is an old question concerning Spark 1.2
I've been trying to compute on the fly the length of a string column in a SchemaRDD for orderBy purposes. I am learning Spark SQL so my question is strictly about using the DSL or the SQL interface that Spark SQL exposes, or to know their limitations.
My first attempt has been to use the integrated relational queries, for instance
notes.select('note).orderBy(length('note))
with no luck at the compilation:
error: not found: value length
(Which makes me wonder where to find what "Expression" this DSL can actually resolve. For instance, it resolves "+" for column additions.)
Then I tried
sql("SELECT note, length(note) as len FROM notes")
This fails with
java.util.NoSuchElementException: key not found: length
(Then I reread this (I'm running 1.2.0) http://spark.apache.org/docs/1.2.0/sql-programming-guide.html#supported-hive-features and wonder in what sense Spark SQL supports the listed hive features.)
Questions: is the length operator really supported in Expressions and/or in SQL statements? If yes, what is the syntax? (bonus: is there a specific documentation about what is resolved in Spark SQL Expressions, and what would be the syntax in general?)
Thanks!
character_length. character_length(expr) - Returns the character length of string data or number of bytes of binary data. The length of string data includes the trailing spaces. The length of binary data includes binary zeros.
Similar to Python Pandas you can get the Size and Shape of the PySpark (Spark with Python) DataFrame by running count() action to get the number of rows on DataFrame and len(df. columns()) to get the number of columns.
Try this in Spark Shell:
case class Note(id:Int,text:String)
val notes=List(Note(1,"One"),Note(2,"Two"),Note(3,"Three"))
val notesRdd=sc.parallelize(notes)
import org.apache.spark.sql.hive.HiveContext
val hc=new HiveContext(sc)
import hc.createSchemaRDD
notesRdd.registerTempTable("note")
hc.sql("select id, text, length(text) from note").foreach(println)
It works on by setup (out of the box spark 1.2.1 with hadoop 2.4):
[2,Two,3]
[1,One,3]
[3,Three,5]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With