Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

compute string length in Spark SQL DSL

Edit: this is an old question concerning Spark 1.2

I've been trying to compute on the fly the length of a string column in a SchemaRDD for orderBy purposes. I am learning Spark SQL so my question is strictly about using the DSL or the SQL interface that Spark SQL exposes, or to know their limitations.

My first attempt has been to use the integrated relational queries, for instance

notes.select('note).orderBy(length('note))

with no luck at the compilation:

error: not found: value length

(Which makes me wonder where to find what "Expression" this DSL can actually resolve. For instance, it resolves "+" for column additions.)

Then I tried

sql("SELECT note, length(note) as len FROM notes")

This fails with

java.util.NoSuchElementException: key not found: length

(Then I reread this (I'm running 1.2.0) http://spark.apache.org/docs/1.2.0/sql-programming-guide.html#supported-hive-features and wonder in what sense Spark SQL supports the listed hive features.)

Questions: is the length operator really supported in Expressions and/or in SQL statements? If yes, what is the syntax? (bonus: is there a specific documentation about what is resolved in Spark SQL Expressions, and what would be the syntax in general?)

Thanks!

like image 427
Wilmerton Avatar asked Feb 16 '15 15:02

Wilmerton


People also ask

How do I find the length of a string in Spark SQL?

character_length. character_length(expr) - Returns the character length of string data or number of bytes of binary data. The length of string data includes the trailing spaces. The length of binary data includes binary zeros.

How do I find the length of my Spark data frame?

Similar to Python Pandas you can get the Size and Shape of the PySpark (Spark with Python) DataFrame by running count() action to get the number of rows on DataFrame and len(df. columns()) to get the number of columns.


1 Answers

Try this in Spark Shell:

case class Note(id:Int,text:String)
val notes=List(Note(1,"One"),Note(2,"Two"),Note(3,"Three"))
val notesRdd=sc.parallelize(notes)
import org.apache.spark.sql.hive.HiveContext
val hc=new HiveContext(sc)
import hc.createSchemaRDD
notesRdd.registerTempTable("note")
hc.sql("select id, text, length(text) from note").foreach(println)

It works on by setup (out of the box spark 1.2.1 with hadoop 2.4):

[2,Two,3]
[1,One,3]
[3,Three,5]
like image 157
G Quintana Avatar answered Sep 20 '22 05:09

G Quintana