I'm trying to implement a join in Spark SQL using a LIKE condition.
The row I am performing the join on looks like this and is called 'revision':
Table A:
8NXDPVAE
Table B:
[4,8]NXD_V%
Performing the join on SQL server (A.revision LIKE B.revision
) works just fine, but when doing the same in Spark SQL, the join returns no rows (if using inner join) or null values for Table B (if using outer join).
This is the query I am running:
val joined = spark.sql("SELECT A.revision, B.revision FROM RAWDATA A LEFT JOIN TPTYPE B ON A.revision LIKE B.revision")
The plan looks like this:
== Physical Plan ==
BroadcastNestedLoopJoin BuildLeft, LeftOuter, revision#15 LIKE revision#282, false
:- BroadcastExchange IdentityBroadcastMode
: +- *Project [revision#15]
: +- *Scan JDBCRelation(RAWDATA) [revision#15] PushedFilters: [EqualTo(bulk_id,2016092419270100198)], ReadSchema: struct<revision>
+- *Scan JDBCRelation(TPTYPE) [revision#282] ReadSchema: struct<revision>
Is it possible to perform a LIKE join like this or am I way off?
Spark SQL is a Spark module for structured data processing. It provides a programming abstraction called DataFrames and can also act as a distributed SQL query engine. It enables unmodified Hadoop Hive queries to run up to 100x faster on existing deployments and data.
As shown above, SQL and PySpark have very similar structure. The df.
Extrapolating the average I/O rate across the duration of the tests (Big SQL is 3.2x faster than Spark SQL), then Spark SQL actually reads almost 12x more data than Big SQL, and writes 30x more data.
By using regexp_replace() Spark function you can replace a column's string value with another string/substring. regexp_replace() uses Java regex for matching, if the regex does not match it returns an empty string.
You are only a little bit off. Spark SQL and Hive follow SQL standard conventions where LIKE
operator accepts only two special characters:
_
(underscore) - which matches an arbitrary character.%
(percent) - which matches an arbitrary sequence of characters.Square brackets have no special meaning and [4,8]
matches only a [4,8]
literal:
spark.sql("SELECT '[4,8]' LIKE '[4,8]'").show
+----------------+
|[4,8] LIKE [4,8]|
+----------------+
| true|
+----------------+
To match complex patterns you can use RLIKE
operator which suports Java regular expressions:
spark.sql("SELECT '8NXDPVAE' RLIKE '^[4,8]NXD.V.*$'").show
+-----------------------------+
|8NXDPVAE RLIKE ^[4,8]NXD.V.*$|
+-----------------------------+
| true|
+-----------------------------+
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With