Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to do left outer join in spark sql?

I am trying to do a left outer join in spark (1.6.2) and it doesn't work. My sql query is like this:

sqlContext.sql("select t.type, t.uuid, p.uuid from symptom_type t LEFT JOIN plugin p  ON t.uuid = p.uuid  where t.created_year = 2016  and p.created_year = 2016").show() 

The result is like this:

+--------------------+--------------------+--------------------+ |                type|                uuid|                uuid| +--------------------+--------------------+--------------------+ |              tained|89759dcc-50c0-490...|89759dcc-50c0-490...| |             swapper|740cd0d4-53ee-438...|740cd0d4-53ee-438...| 

I got same result either using LEFT JOIN or LEFT OUTER JOIN (the second uuid is not null).

I would expect the second uuid column to be null only. how to do a left outer join correctly?

=== Additional information ==

If I using dataframe to do left outer join i got correct result.

s = sqlCtx.sql('select * from symptom_type where created_year = 2016') p = sqlCtx.sql('select * from plugin where created_year = 2016')  s.join(p, s.uuid == p.uuid, 'left_outer') .select(s.type, s.uuid.alias('s_uuid'),          p.uuid.alias('p_uuid'), s.created_date, p.created_year, p.created_month).show() 

I got result like this:

+-------------------+--------------------+-----------------+--------------------+------------+-------------+ |               type|              s_uuid|           p_uuid|        created_date|created_year|created_month| +-------------------+--------------------+-----------------+--------------------+------------+-------------+ |             tained|6d688688-96a4-341...|             null|2016-01-28 00:27:...|        null|         null| |             tained|6d688688-96a4-341...|             null|2016-01-28 00:27:...|        null|         null| |             tained|6d688688-96a4-341...|             null|2016-01-28 00:27:...|        null|         null| 

Thanks,

like image 232
Sean Nguyen Avatar asked Nov 04 '16 11:11

Sean Nguyen


People also ask

How do you do a left join in Spark?

Spark SQL Left Outer Join (left, left outer, left_outer) join returns all rows from the left DataFrame regardless of match found on the right Dataframe, when join expression doesn't match, it assigns null for that record and drops records from right where match not found.

How do you do an outer join in Spark SQL?

In order to use Full Outer Join on Spark DataFrame, you can use either outer , full , fullouter Join as a join type. From our emp dataset's emp_dept_id with value 60 doesn't have a record on dept hence dept columns have null and dept_id 30 doesn't have a record in emp hence you see null's on emp columns.

How do I join a table in Spark?

In order to explain join with multiple tables, we will use Inner join, this is the default join in Spark and it's mostly used, this joins two DataFrames/Datasets on key columns, and where keys don't match the rows get dropped from both datasets.

How do joins work in Spark?

In order to join data, Spark needs the data that is to be joined (i.e., the data based on each key) to live on the same partition. The default implementation of a join in Spark is a shuffled hash join.


2 Answers

I don't see any issues in your code. Both "left join" or "left outer join" will work fine. Please check the data again the data you are showing is for matches.

You can also perform Spark SQL join by using:

// Left outer join explicit

df1.join(df2, df1["col1"] == df2["col1"], "left_outer") 
like image 157
Arvind Kumar Avatar answered Oct 08 '22 14:10

Arvind Kumar


You are filtering out null values for p.created_year (and for p.uuid) with

where t.created_year = 2016  and p.created_year = 2016 

The way to avoid this is to move filtering clause for p to the ON statement:

sqlContext.sql("select t.type, t.uuid, p.uuid from symptom_type t LEFT JOIN plugin p  ON t.uuid = p.uuid  and p.created_year = 2016 where t.created_year = 2016").show() 

This is correct but inefficient because we also need to filter on t.created_year before the join happens. So it is recommended to use subqueries:

sqlContext.sql("select t.type, t.uuid, p.uuid from (   SELECT type, uuid FROM symptom_type WHERE created_year = 2016  ) t LEFT JOIN (   SELECT uuid FROM plugin WHERE created_year = 2016 ) p  ON t.uuid = p.uuid").show()     
like image 36
Ivan Medvedev Avatar answered Oct 08 '22 16:10

Ivan Medvedev