Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Filter spark DataFrame on string contains

I am using Spark 1.3.0 and Spark Avro 1.0.0. I am working from the example on the repository page. This following code works well

val df = sqlContext.read.avro("src/test/resources/episodes.avro") df.filter("doctor > 5").write.avro("/tmp/output") 

But what if I needed to see if the doctor string contains a substring? Since we are writing our expression inside of a string. What do I do to do a "contains"?

like image 252
Knows Not Much Avatar asked Mar 02 '16 22:03

Knows Not Much


People also ask

How do I filter string values in PySpark?

In Spark & PySpark, contains() function is used to match a column value contains in a literal string (matches on part of the string), this is mostly used to filter rows on DataFrame.

How do I find a substring in PySpark?

The DataFrame. withColumn(colName, col) can be used for extracting substring from the column data by using pyspark's substring() function along with it.

How do I filter rows in Spark DataFrame?

Spark filter() or where() function is used to filter the rows from DataFrame or Dataset based on the given one or multiple conditions or SQL expression. You can use where() operator instead of the filter if you are coming from SQL background. Both these functions operate exactly the same.

How do you use like on Spark?

In Spark & PySpark like() function is similar to SQL LIKE operator that is used to match based on wildcard characters (percentage, underscore) to filter the rows. You can use this function to filter the DataFrame rows by single or multiple conditions, to derive a new column, use it on when().


2 Answers

You can use contains (this works with an arbitrary sequence):

df.filter($"foo".contains("bar")) 

like (SQL like with SQL simple regular expression whith _ matching an arbitrary character and % matching an arbitrary sequence):

df.filter($"foo".like("bar")) 

or rlike (like with Java regular expressions):

df.filter($"foo".rlike("bar")) 

depending on your requirements. LIKE and RLIKE should work with SQL expressions as well.

like image 173
zero323 Avatar answered Sep 25 '22 01:09

zero323


In pyspark,SparkSql syntax:

where column_n like 'xyz%' 

might not work.

Use:

where column_n RLIKE '^xyz'  

This works perfectly fine.

like image 35
Sam91 Avatar answered Sep 23 '22 01:09

Sam91