Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to split comma separated string and get n values in Spark Scala dataframe?

How to take only 2 data from arraytype column in Spark Scala? I got the data like val df = spark.sqlContext.sql("select col1, col2 from test_tbl").

I have data like following:

col1  | col2                              
---   | ---
a     | [test1,test2,test3,test4,.....]   
b     | [a1,a2,a3,a4,a5,.....]       

I want to get data like following:

col1| col2
----|----
a   | test1,test2
b   | a1,a2

When I am doing df.withColumn("test", col("col2").take(5)) it is not working. It give this error:

value take is not a member of org.apache.spark.sql.ColumnName

How can I get the data in above order?

like image 668
Narendra Mohan Prasad Avatar asked Jul 13 '17 17:07

Narendra Mohan Prasad


People also ask

What is Rlike in Scala?

RLIKE searches a string for a regular expression pattern.

How do you separate comma Separated Values in Pyspark?

In pyspark SQL, the split() function converts the delimiter separated String to an Array. It is done by splitting the string based on delimiters like spaces, commas, and stack them into an array. This function returns pyspark. sql.

What is Rlike in spark?

rlike() function can be used to derive a new Spark/PySpark DataFrame column from an existing column, filter data by matching it with regular expressions, use with conditions, and many more.


1 Answers

Inside withColumn you can call udf getPartialstring for that you can use slice or take method like below example snippet untested.

  import sqlContext.implicits._
  import org.apache.spark.sql.functions._

  val getPartialstring = udf((array : Seq[String], fromIndex : Int, toIndex : Int) 
   => array.slice(fromIndex ,toIndex ).mkString(",")) 

your caller will appear like

 df.withColumn("test",getPartialstring(col("col2"))

col("col2").take(5) is failing because column doesn't have a method take(..) that's why your error message says

error: value take is not a member of org.apache.spark.sql.ColumnName

You can use udf approach to tackle this.

like image 100
Ram Ghadiyaram Avatar answered Sep 28 '22 11:09

Ram Ghadiyaram