Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Spark dataframe filter

val df = sc.parallelize(Seq((1,"Emailab"), (2,"Phoneab"), (3, "Faxab"),(4,"Mail"),(5,"Other"),(6,"MSL12"),(7,"MSL"),(8,"HCP"),(9,"HCP12"))).toDF("c1","c2")

+---+-------+
| c1|     c2|
+---+-------+
|  1|Emailab|
|  2|Phoneab|
|  3|  Faxab|
|  4|   Mail|
|  5|  Other|
|  6|  MSL12|
|  7|    MSL|
|  8|    HCP|
|  9|  HCP12|
+---+-------+

I want to filter out records which have first 3 characters of column 'c2' either 'MSL' or 'HCP'.

So the output should be like below.

+---+-------+
| c1|     c2|
+---+-------+
|  1|Emailab|
|  2|Phoneab|
|  3|  Faxab|
|  4|   Mail|
|  5|  Other|
+---+-------+

Can any one please help on this?

I knew that df.filter($"c2".rlike("MSL")) -- This is for selecting the records but how to exclude the records. ?

Version: Spark 1.6.2 Scala : 2.10

like image 868
Ramesh Avatar asked Mar 22 '17 12:03

Ramesh


People also ask

How do you filter records from a DataFrame in Spark?

Spark filter() or where() function is used to filter the rows from DataFrame or Dataset based on the given one or multiple conditions or SQL expression. You can use where() operator instead of the filter if you are coming from SQL background. Both these functions operate exactly the same.

What is the function of filter () in Spark?

In Spark, the Filter function returns a new dataset formed by selecting those elements of the source on which the function returns true. So, it retrieves only the elements that satisfy the given condition.


1 Answers

This works too. Concise and very similar to SQL.

df.filter("c2 not like 'MSL%' and c2 not like 'HCP%'").show
+---+-------+
| c1|     c2|
+---+-------+
|  1|Emailab|
|  2|Phoneab|
|  3|  Faxab|
|  4|   Mail|
|  5|  Other|
+---+-------+
like image 168
Jegan Avatar answered Oct 01 '22 09:10

Jegan