How can we JOIN two Spark SQL dataframes using a SQL-esque "LIKE" criterion?

Tags:

We are using the PySpark libraries interfacing with Spark 1.3.1.

We have two dataframes, documents_df := {document_id, document_text} and keywords_df := {keyword}. We would like to JOIN the two dataframes and return a resulting dataframe with {document_id, keyword} pairs, using the criteria that the keyword_df.keyword appears in the document_df.document_text string.

In PostgreSQL, for example, we could achieve this using an ON clause of the form:

document_df.document_text ilike '%' || keyword_df.keyword || '%'

In PySpark however, I cannot get any form of join syntax to work. Has anybody achieved something like this before?

With kind regards,

Will

987

asked Oct 16 '15 11:10

Will Hardman

1 Answers

It is possible in a two different ways but generally speaking not recommended. First lets create a dummy data:

from pyspark.sql import Row

document_row = Row("document_id", "document_text")
keyword_row = Row("keyword") 

documents_df = sc.parallelize([
    document_row(1L, "apache spark is the best"),
    document_row(2L, "erlang rocks"),
    document_row(3L, "but haskell is better")
]).toDF()

keywords_df = sc.parallelize([
    keyword_row("erlang"),
    keyword_row("haskell"),
    keyword_row("spark")
]).toDF()

Hive UDFs

documents_df.registerTempTable("documents")
keywords_df.registerTempTable("keywords")

query = """SELECT document_id, keyword
    FROM documents JOIN keywords
    ON document_text LIKE CONCAT('%', keyword, '%')"""

like_with_hive_udf = sqlContext.sql(query)
like_with_hive_udf.show()

## +-----------+-------+
## |document_id|keyword|
## +-----------+-------+
## |          1|  spark|
## |          2| erlang|
## |          3|haskell|
## +-----------+-------+

Python UDF

from pyspark.sql.functions import udf, col 
from pyspark.sql.types import BooleanType

# Of you can replace `in` with a regular expression
contains = udf(lambda s, q: q in s, BooleanType())

like_with_python_udf = (documents_df.join(keywords_df)
    .where(contains(col("document_text"), col("keyword")))
    .select(col("document_id"), col("keyword")))
like_with_python_udf.show()

## +-----------+-------+
## |document_id|keyword|
## +-----------+-------+
## |          1|  spark|
## |          2| erlang|
## |          3|haskell|
## +-----------+-------+

Why not recommended? Because in both cases it requires a Cartesian product:

like_with_hive_udf.explain()

## TungstenProject [document_id#2L,keyword#4]
##  Filter document_text#3 LIKE concat(%,keyword#4,%)
##   CartesianProduct
##    Scan PhysicalRDD[document_id#2L,document_text#3]
##    Scan PhysicalRDD[keyword#4]

like_with_python_udf.explain()

## TungstenProject [document_id#2L,keyword#4]
##  Filter pythonUDF#13
##   !BatchPythonEvaluation PythonUDF#<lambda>(document_text#3,keyword#4), ...
##    CartesianProduct
##     Scan PhysicalRDD[document_id#2L,document_text#3]
##     Scan PhysicalRDD[keyword#4]

There are other ways to achieve a similar effect without a full Cartesian.

Join on tokenized document - useful if keywords list is to large to be handled in a memory of a single machine

from pyspark.ml.feature import Tokenizer
from pyspark.sql.functions import explode

tokenizer = Tokenizer(inputCol="document_text", outputCol="words")

tokenized = (tokenizer.transform(documents_df)
    .select(col("document_id"), explode(col("words")).alias("token")))

like_with_tokenizer = (tokenized
    .join(keywords_df, col("token") == col("keyword"))
    .drop("token"))

like_with_tokenizer.show()

## +-----------+-------+
## |document_id|keyword|
## +-----------+-------+
## |          3|haskell|
## |          1|  spark|
## |          2| erlang|
## +-----------+-------+

This requires shuffle but not Cartesian:

like_with_tokenizer.explain()

## TungstenProject [document_id#2L,keyword#4]
##  SortMergeJoin [token#29], [keyword#4]
##   TungstenSort [token#29 ASC], false, 0
##    TungstenExchange hashpartitioning(token#29)
##     TungstenProject [document_id#2L,token#29]
##      !Generate explode(words#27), true, false, [document_id#2L, ...
##       ConvertToSafe
##        TungstenProject [document_id#2L,UDF(document_text#3) AS words#27]
##         Scan PhysicalRDD[document_id#2L,document_text#3]
##   TungstenSort [keyword#4 ASC], false, 0
##    TungstenExchange hashpartitioning(keyword#4)
##     ConvertToUnsafe
##      Scan PhysicalRDD[keyword#4]

Python UDF and broadcast variable - if keywords list is relatively small

from pyspark.sql.types import ArrayType, StringType

keywords = sc.broadcast(set(
    keywords_df.map(lambda row: row[0]).collect()))

bd_contains = udf(
    lambda s: list(set(s.split()) & keywords.value), 
    ArrayType(StringType()))


like_with_bd = (documents_df.select(
    col("document_id"), 
    explode(bd_contains(col("document_text"))).alias("keyword")))

like_with_bd.show()

## +-----------+-------+
## |document_id|keyword|
## +-----------+-------+
## |          1|  spark|
## |          2| erlang|
## |          3|haskell|
## +-----------+-------+

It requires neither shuffle nor Cartesian but you still have to transfer broadcast variable to each worker node.

like_with_bd.explain()

## TungstenProject [document_id#2L,keyword#46]
##  !Generate explode(pythonUDF#47), true, false, ...
##   ConvertToSafe
##    TungstenProject [document_id#2L,pythonUDF#47]
##     !BatchPythonEvaluation PythonUDF#<lambda>(document_text#3), ...
##      Scan PhysicalRDD[document_id#2L,document_text#3]

Since Spark 1.6.0 you can mark a small data frame using sql.functions.broadcast to get a similar effect as above without using UDFs and explicit broadcast variables. Reusing tokenized data:

from pyspark.sql.functions import broadcast

like_with_tokenizer_and_bd = (broadcast(tokenized)
    .join(keywords_df, col("token") == col("keyword"))
    .drop("token"))

like_with_tokenizer.explain()

## TungstenProject [document_id#3L,keyword#5]
##  BroadcastHashJoin [token#10], [keyword#5], BuildLeft
##   TungstenProject [document_id#3L,token#10]
##    !Generate explode(words#8), true, false, ...
##     ConvertToSafe
##      TungstenProject [document_id#3L,UDF(document_text#4) AS words#8]
##       Scan PhysicalRDD[document_id#3L,document_text#4]
##   ConvertToUnsafe
##    Scan PhysicalRDD[keyword#5]

Related:

For approximate matching see Efficient string matching in Apache Spark.

169

answered Nov 27 '22 17:11

zero323

Related questions
                            
                                Scrapy set depth limit per allowed_domains
                            
                                How can I simply pass arguments to parent constructor in child class?
                            
                                'Too many values to unpack' with solvePnPRansac() - Pose Estimation
                            
                                Histogram equalization of grayscale images with NumPy
                            
                                Anaconda python, PIL and imagingtk
                            
                                Line continuation/wrapping in doctest
                            
                                Python logging and Pydev debugger?
                            
                                Flask: conditional validation on multiple form fields
                            
                                How to remove python in /usr/local/bin/ [closed]
                            
                                How to generate random pairs of numbers in Python, including pairs with one entry being the same and excluding pairs with both entries being the same?
                            
                                PyCharm helpers not appearing in vagrant
                            
                                &nbsp text not handled by the lxml
                            
                                Why does 'the' survive after .remove?
                            
                                Dynamically add subplots in matplotlib with more than one column
                            
                                Python: How to calculate the length of a range without creating the range?
                            
                                how to kill a process group using Python subprocess
                            
                                Unable to install mysqlclient in python3 virtualenv
                            
                                Parse an XML string in Python
                            
                                Get first element of sublist as dictionary key in python
                            
                                python build from source: cannot build optional module sqlite3

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How can we JOIN two Spark SQL dataframes using a SQL-esque "LIKE" criterion?

Tags:

python

apache-spark

apache-spark-sql

pyspark

Will Hardman

People also ask

1 Answers

zero323

Recent Activity

Donate For Us