Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

SQL LIKE in Spark SQL

I'm trying to implement a join in Spark SQL using a LIKE condition.

The row I am performing the join on looks like this and is called 'revision':

Table A:

8NXDPVAE

Table B:

[4,8]NXD_V%

Performing the join on SQL server (A.revision LIKE B.revision) works just fine, but when doing the same in Spark SQL, the join returns no rows (if using inner join) or null values for Table B (if using outer join).

This is the query I am running:

val joined = spark.sql("SELECT A.revision, B.revision FROM RAWDATA A LEFT JOIN TPTYPE B ON A.revision LIKE B.revision")

The plan looks like this:

== Physical Plan ==
BroadcastNestedLoopJoin BuildLeft, LeftOuter, revision#15 LIKE revision#282, false
:- BroadcastExchange IdentityBroadcastMode
:  +- *Project [revision#15]
:     +- *Scan JDBCRelation(RAWDATA) [revision#15] PushedFilters: [EqualTo(bulk_id,2016092419270100198)], ReadSchema: struct<revision>
+- *Scan JDBCRelation(TPTYPE) [revision#282] ReadSchema: struct<revision>

Is it possible to perform a LIKE join like this or am I way off?

like image 413
Dan Markhasin Avatar asked Nov 06 '16 20:11

Dan Markhasin


People also ask

Is Spark similar to SQL?

Spark SQL is a Spark module for structured data processing. It provides a programming abstraction called DataFrames and can also act as a distributed SQL query engine. It enables unmodified Hadoop Hive queries to run up to 100x faster on existing deployments and data.

Is PySpark similar to SQL?

As shown above, SQL and PySpark have very similar structure. The df.

Is Spark SQL faster than SQL?

Extrapolating the average I/O rate across the duration of the tests (Big SQL is 3.2x faster than Spark SQL), then Spark SQL actually reads almost 12x more data than Big SQL, and writes 30x more data.

How do I replace a character in Spark SQL?

By using regexp_replace() Spark function you can replace a column's string value with another string/substring. regexp_replace() uses Java regex for matching, if the regex does not match it returns an empty string.


1 Answers

You are only a little bit off. Spark SQL and Hive follow SQL standard conventions where LIKE operator accepts only two special characters:

  • _ (underscore) - which matches an arbitrary character.
  • % (percent) - which matches an arbitrary sequence of characters.

Square brackets have no special meaning and [4,8] matches only a [4,8] literal:

spark.sql("SELECT '[4,8]' LIKE '[4,8]'").show
+----------------+
|[4,8] LIKE [4,8]|
+----------------+
|            true|
+----------------+

To match complex patterns you can use RLIKE operator which suports Java regular expressions:

spark.sql("SELECT '8NXDPVAE' RLIKE '^[4,8]NXD.V.*$'").show
+-----------------------------+
|8NXDPVAE RLIKE ^[4,8]NXD.V.*$|
+-----------------------------+
|                         true|
+-----------------------------+
like image 70
zero323 Avatar answered Oct 20 '22 17:10

zero323