I am using Spark 1.3.0 and Spark Avro 1.0.0. I am working from the example on the repository page. This following code works well <pre class="prettyprint"><code>val df = sqlContext.read.avro("src/test/resources/episodes.avro") df.filter("doctor > 5").write.avro("/tmp/output") </code></pre> But what if I needed to see if the <code>doctor</code> string contains a substring? Since we are writing our expression inside of a string. What do I do to do a "contains"?

You can use <code>contains</code> (this works with an arbitrary sequence): <pre class="prettyprint"><code>df.filter($"foo".contains("bar")) </code></pre> <code>like</code> (SQL like with SQL simple regular expression whith <code>_</code> matching an arbitrary character and <code>%</code> matching an arbitrary sequence): <pre class="prettyprint"><code>df.filter($"foo".like("bar")) </code></pre> or <code>rlike</code> (like with Java regular expressions): <pre class="prettyprint"><code>df.filter($"foo".rlike("bar")) </code></pre> depending on your requirements. <code>LIKE</code> and <code>RLIKE</code> should work with SQL expressions as well.

In pyspark,SparkSql syntax: <pre class="prettyprint"><code>where column_n like 'xyz%' </code></pre> might not work. Use: <pre class="prettyprint"><code>where column_n RLIKE '^xyz' </code></pre> This works perfectly fine.

Filter spark DataFrame on string contains

Tags:

dataframe

scala

apache-spark

apache-spark-sql

I am using Spark 1.3.0 and Spark Avro 1.0.0. I am working from the example on the repository page. This following code works well

val df = sqlContext.read.avro("src/test/resources/episodes.avro") df.filter("doctor > 5").write.avro("/tmp/output")

But what if I needed to see if the doctor string contains a substring? Since we are writing our expression inside of a string. What do I do to do a "contains"?

252

asked Mar 02 '16 22:03

Knows Not Much

2 Answers

You can use contains (this works with an arbitrary sequence):

df.filter($"foo".contains("bar"))

like (SQL like with SQL simple regular expression whith _ matching an arbitrary character and % matching an arbitrary sequence):

df.filter($"foo".like("bar"))

or rlike (like with Java regular expressions):

df.filter($"foo".rlike("bar"))

depending on your requirements. LIKE and RLIKE should work with SQL expressions as well.

173

answered Sep 25 '22 01:09

zero323

In pyspark,SparkSql syntax:

where column_n like 'xyz%'

might not work.

Use:

where column_n RLIKE '^xyz'

This works perfectly fine.

answered Sep 23 '22 01:09

Sam91

Related questions
                            
                                What is the eta expansion in Scala?
                            
                                Mixing in a trait dynamically
                            
                                In Scala, what is the difference between Any and Object?
                            
                                How to restrict actor messages to specific types?
                            
                                Ruby vs Scala - pros and contras of each one [closed]
                            
                                Is there any analog for Scala 'zip' function in Groovy?
                            
                                How to define maven test-jar dependency in sbt
                            
                                How are message-passing concurrent languages better than shared-memory concurrent languages in practice
                            
                                scala median implementation
                            
                                What's the easiest way to use reify (get an AST of) an expression in Scala?
                            
                                How to get a list with the Typesafe config library
                            
                                What exactly is Dotty?
                            
                                Error: scala: No 'scala-library*.jar' in Scala compiler library
                            
                                Why do case class companion objects extend FunctionN?
                            
                                Slick 3.0 Insert and then get Auto Increment Value
                            
                                How do I match multiple arguments?
                            
                                What is the difference between `##` and `hashCode`?
                            
                                Scala Spark DataFrame : dataFrame.select multiple columns given a Sequence of column names
                            
                                Main method in Scala
                            
                                How to create DataFrame from Scala's List of Iterables?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With