Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do I use "not rlike" in spark-sql?

rlike works fine but not rlike throws an error:

scala> sqlContext.sql("select * from T where columnB rlike '^[0-9]*$'").collect()
res42: Array[org.apache.spark.sql.Row] = Array([412,0], [0,25], [412,25], [0,25])

scala> sqlContext.sql("select * from T where columnB not rlike '^[0-9]*$'").collect()
java.lang.RuntimeException: [1.35] failure: ``in'' expected but `rlike' found


val df = sc.parallelize(Seq(
  (412, 0),
  (0, 25), 
  (412, 25), 
  (0, 25)
)).toDF("columnA", "columnB")

Or it is continuation of issue https://issues.apache.org/jira/browse/SPARK-4207 ?

like image 567
WoodChopper Avatar asked Dec 30 '15 17:12

WoodChopper


2 Answers

A concise way to do it in PySpark is:

df.filter(~df.column.rlike(pattern))
like image 77
pleicht17 Avatar answered Oct 05 '22 10:10

pleicht17


There is nothing as such not rlike, but in regex you have something called negative lookahead, which means it will give the words that does not match.

For above query, you can use the regex as below. Lets say, you want the ColumnB should not start with digits '0'

Then you can do like this.

sqlContext.sql("select * from T where columnB rlike '^(?!.*[1-9]).*$'").collect() 
Result: Array[org.apache.spark.sql.Row] = Array([412,0])

What I meant over all is, you have to do with regex it self to negate the match, not with rlike. Rlike simply matches the regex that you asked to match. If your regex tells it to not match, it applies that, if your regex is for matching then it does that.

like image 25
Srini Avatar answered Oct 05 '22 10:10

Srini