I am using Spark 1.3.0 and Spark Avro 1.0.0. I am working from the example on the repository page. This following code works well
val df = sqlContext.read.avro("src/test/resources/episodes.avro") df.filter("doctor > 5").write.avro("/tmp/output")
But what if I needed to see if the doctor
string contains a substring? Since we are writing our expression inside of a string. What do I do to do a "contains"?
In Spark & PySpark, contains() function is used to match a column value contains in a literal string (matches on part of the string), this is mostly used to filter rows on DataFrame.
The DataFrame. withColumn(colName, col) can be used for extracting substring from the column data by using pyspark's substring() function along with it.
Spark filter() or where() function is used to filter the rows from DataFrame or Dataset based on the given one or multiple conditions or SQL expression. You can use where() operator instead of the filter if you are coming from SQL background. Both these functions operate exactly the same.
In Spark & PySpark like() function is similar to SQL LIKE operator that is used to match based on wildcard characters (percentage, underscore) to filter the rows. You can use this function to filter the DataFrame rows by single or multiple conditions, to derive a new column, use it on when().
You can use contains
(this works with an arbitrary sequence):
df.filter($"foo".contains("bar"))
like
(SQL like with SQL simple regular expression whith _
matching an arbitrary character and %
matching an arbitrary sequence):
df.filter($"foo".like("bar"))
or rlike
(like with Java regular expressions):
df.filter($"foo".rlike("bar"))
depending on your requirements. LIKE
and RLIKE
should work with SQL expressions as well.
In pyspark,SparkSql syntax:
where column_n like 'xyz%'
might not work.
Use:
where column_n RLIKE '^xyz'
This works perfectly fine.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With