df1 has fields <code>id</code> and <code>json</code>; df2 has fields <code>id</code>and <code>json</code> <code>df1.count()</code> => 1200; <code>df2.count()</code> => 20 df1 has all the rows. df2 has an incremental update with just 20 rows. My goal is to update df1 with the values from <code>df2</code>. All the ids of <code>df2</code> are in df1. But df2 has updated values(in the <code>json</code> field) for those same ids. Resulting df should have all the values from <code>df1</code> and updated values from <code>df2</code>. What is the best way to do this? - With the least number of joins and filters. Thanks!

You can achieve this using one left join. Create Example DataFrames Using the sample data provided by @Shankar Koirala in his answer. <pre class="prettyprint lang-python prettyprint-override"><code>data1 = [ (1, "a"), (2, "b"), (3, "c") ] df1 = sqlCtx.createDataFrame(data1, ["id", "value"]) data2 = [ (1, "x"), (2, "y") ] df2 = sqlCtx.createDataFrame(data2, ["id", "value"]) </code></pre> Do a left join Join the two DataFrames using a left join on the <code>id</code> column. This will keep all of the rows in the left DataFrame. For the rows in the right DataFrame that don't have a matching <code>id</code>, the value will be <code>null</code>. <pre class="prettyprint lang-python prettyprint-override"><code>import pyspark.sql.functions as f df1.alias('l').join(df2.alias('r'), on='id', how='left')\ .select( 'id', f.col('l.value').alias('left_value'), f.col('r.value').alias('right_value') )\ .show() #+---+----------+-----------+ #| id|left_value|right_value| #+---+----------+-----------+ #| 1| a| x| #| 3| c| null| #| 2| b| y| #+---+----------+-----------+ </code></pre> Select the desired data We will use the fact that the unmatched <code>id</code>s have a <code>null</code> to select the final columns. Use <code>pyspark.sql.functions.when()</code> to use the right value if it is not null, otherwise keep the left value. <pre class="prettyprint lang-python prettyprint-override"><code>df1.alias('l').join(df2.alias('r'), on='id', how='left')\ .select( 'id', f.when( ~f.isnull(f.col('r.value')), f.col('r.value') ).otherwise(f.col('l.value')).alias('value') )\ .show() #+---+-----+ #| id|value| #+---+-----+ #| 1| x| #| 3| c| #| 2| y| #+---+-----+ </code></pre> You can sort this output if you want the <code>id</code>s in order. <hr> Using pyspark-sql You can do the same thing using a pyspark-sql query: <pre class="prettyprint lang-python prettyprint-override"><code>df1.registerTempTable('df1') df2.registerTempTable('df2') query = """SELECT l.id, CASE WHEN r.value IS NOT NULL THEN r.value ELSE l.value END AS value FROM df1 l LEFT JOIN df2 r ON l.id = r.id""" sqlCtx.sql(query.replace("\n", "")).show() #+---+-----+ #| id|value| #+---+-----+ #| 1| x| #| 3| c| #| 2| y| #+---+-----+ </code></pre>

update a dataframe column with new values

1 Answers

You can achieve this using one left join.

Create Example DataFrames

Using the sample data provided by @Shankar Koirala in his answer.

data1 = [
  (1, "a"),
  (2, "b"),
  (3, "c")
]
df1 = sqlCtx.createDataFrame(data1, ["id", "value"])

data2 = [
  (1, "x"), 
  (2, "y")
]

df2 = sqlCtx.createDataFrame(data2, ["id", "value"])

Do a left join

Join the two DataFrames using a left join on the id column. This will keep all of the rows in the left DataFrame. For the rows in the right DataFrame that don't have a matching id, the value will be null.

import pyspark.sql.functions as f
df1.alias('l').join(df2.alias('r'), on='id', how='left')\
    .select(
        'id',
         f.col('l.value').alias('left_value'),
         f.col('r.value').alias('right_value')
    )\
    .show()
#+---+----------+-----------+
#| id|left_value|right_value|
#+---+----------+-----------+
#|  1|         a|          x|
#|  3|         c|       null|
#|  2|         b|          y|
#+---+----------+-----------+

Select the desired data

We will use the fact that the unmatched ids have a null to select the final columns. Use pyspark.sql.functions.when() to use the right value if it is not null, otherwise keep the left value.

df1.alias('l').join(df2.alias('r'), on='id', how='left')\
    .select(
        'id',
        f.when(
            ~f.isnull(f.col('r.value')),
            f.col('r.value')
        ).otherwise(f.col('l.value')).alias('value')
    )\
    .show()
#+---+-----+
#| id|value|
#+---+-----+
#|  1|    x|
#|  3|    c|
#|  2|    y|
#+---+-----+

You can sort this output if you want the ids in order.

Using pyspark-sql

You can do the same thing using a pyspark-sql query:

df1.registerTempTable('df1')
df2.registerTempTable('df2')

query = """SELECT l.id, 
CASE WHEN r.value IS NOT NULL THEN r.value ELSE l.value END AS value 
FROM df1 l LEFT JOIN df2 r ON l.id = r.id"""
sqlCtx.sql(query.replace("\n", "")).show()
#+---+-----+
#| id|value|
#+---+-----+
#|  1|    x|
#|  3|    c|
#|  2|    y|
#+---+-----+

answered Oct 01 '22 21:10

pault

Related questions
                            
                                how to change column value in spark sql
                            
                                How to write streaming dataset to Kafka?
                            
                                Kafka with Spark 2.1 Structured Streaming - cannot deserialize
                            
                                I am getting an error while creating a simple RDD in Spark
                            
                                Spark Pipeline error
                            
                                spring autoconfiguration class is missing in META-INF/spring.factories
                            
                                NoClassDefFoundError: Could not initialize XXX class after deploying on spark standalone cluster
                            
                                How to cache partitioned dataset and use in multiple queries?
                            
                                Pyspark udf high memory utilization
                            
                                Enum equivalent in Spark Dataframe/Parquet
                            
                                Cumulative distinct count with Spark SQL
                            
                                pyspark.sql.utils.IllegalArgumentException: "Error while instantiating 'org.apache.spark.sql.hive.HiveSessionStateBuild in windows 10
                            
                                How handle categorical features in the latest Random Forest in Spark?
                            
                                Why is difference between sqlContext.read.load and sqlContext.read.text?
                            
                                Which would be a quicker (and better) tool for querying data stored in the Parquet format - Spark SQL, Athena or ElasticSearch?
                            
                                How does Serialized RDD occupy less space in memory?
                            
                                Error: Could not write class iw because it exceeds JVM code size limits. Method code too large
                            
                                Scala: How to combine two data frames?
                            
                                How to implement `except` in Apache Spark based on subset of columns?
                            
                                how to convert a timestamp into string (without changing timezone)?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

update a dataframe column with new values

Tags:

apache-spark

pyspark

suprita shankar

People also ask

1 Answers

pault

Recent Activity

Donate For Us