overwrite column values using other column values based on conditions pyspark

Tags:

pyspark

I have a data frame in pyspark like below.

df.show()

+-----------+------------+-------------+
|customer_id|product_name|      country|
+-----------+------------+-------------+
|   12870946|        null|       Poland|
|     815518|       MA401|United States|
|    3138420|     WG111v2|           UK|
|    3178864|    WGR614v6|United States|
|    7456796|       XE102|United States|
|   21893468|     AGM731F|United States|
+-----------+------------+-------------+

I have another data frame like below df1.show()

+-----------+------------+
|customer_id|product_name|
+-----------+------------+
|   12870946|     GS748TS|
|     815518|       MA402|
|    3138420|        null|
|    3178864|    WGR614v6|
|    7456796|       XE102|
|   21893468|     AGM731F|
|       null|       AE171|
+-----------+------------+

Now I want to do a fuller outer join on these tables and update product_name column values like below.

1) Overwrite the values in `df` using values in `df1` if there are values in `df1`.
2) if there are `null` values or `no` values in `df1` then leave the values in `df` as they are

expected result

+-----------+------------+-------------+
|customer_id|product_name|      country|
+-----------+------------+-------------+
|   12870946|     GS748TS|       Poland|
|     815518|       MA402|United States|
|    3138420|     WG111v2|           UK|
|    3178864|    WGR614v6|United States|
|    7456796|       XE102|United States|
|   21893468|     AGM731F|United States|
|       null|       AE171|         null|
+-----------+------------+-------------+

I have done like below

import pyspark.sql.functions as f
df2 = df.join(df1, df.customer_id == df1.customer_id, 'full_outer').select(df.customer_id, f.coalesce(df.product_name, df1.product_name).alias('product_name'), df.country)

But the result I am getting is different

df2.show()

+-----------+------------+-------------+
|customer_id|product_name|      country|
+-----------+------------+-------------+
|   12870946|        null|       Poland|
|     815518|       MA401|United States|
|    3138420|     WG111v2|           UK|
|    3178864|    WGR614v6|United States|
|    7456796|       XE102|United States|
|   21893468|     AGM731F|United States|
|       null|       AE171|         null|
+-----------+------------+-------------+

How can I get the expected result

556

asked May 17 '18 23:05

2 Answers

The code you wrote produces the correct output for me, so I can not reproduce your problem. I have seen other posts where using an alias when doing joins has resolved issues so here is a slightly modified version of your code which will do the same thing:

import pyspark.sql.functions as f

df.alias("r").join(df1.alias("l"), on="customer_id", how='full_outer')\
    .select(
        "customer_id",
        f.coalesce("r.product_name", "l.product_name").alias('product_name'),
        "country"
    )\
    .show()
#+-----------+------------+-------------+
#|customer_id|product_name|      country|
#+-----------+------------+-------------+
#|    7456796|       XE102|United States|
#|    3178864|    WGR614v6|United States|
#|       null|       AE171|         null|
#|     815518|       MA401|United States|
#|    3138420|     WG111v2|           UK|
#|   12870946|     GS748TS|       Poland|
#|   21893468|     AGM731F|United States|
#+-----------+------------+-------------+

I get the same results when I run your code as well (reproduced below):

df.join(df1, df.customer_id == df1.customer_id, 'full_outer')\
    .select(
        df.customer_id,
        f.coalesce(df.product_name, df1.product_name).alias('product_name'),
        df.country
    )\
    .show()

I am using spark 2.1 and python 2.7.13.

103

answered Oct 25 '22 05:10

pault

Your code is perfect if the values are not string null. But looking at the df2 dataframe you are getting the values in product_name seem to be string null. You will have to check for string null using when inbuilt function and isnull inbuilt funtion as

import pyspark.sql.functions as f
df2 = df.join(df1, df.customer_id == df1.customer_id, 'full_outer')\
    .select(df.customer_id, f.when(f.isnull(df.product_name) | (df.product_name == "null"), df1.product_name).otherwise(df.product_name).alias('product_name'), df.country)
df2.show(truncate=False)

which should give you

+-----------+------------+------------+
|customer_id|product_name|country     |
+-----------+------------+------------+
|7456796    |XE102       |UnitedStates|
|3178864    |WGR614v6    |UnitedStates|
|815518     |MA401       |UnitedStates|
|3138420    |WG111v2     |UK          |
|12870946   |GS748TS     |Poland      |
|21893468   |AGM731F     |UnitedStates|
|null       |AE171       |null        |
+-----------+------------+------------+

answered Oct 25 '22 05:10

Ramesh Maharjan

Related questions
                            
                                How to count number of occurrences by using pyspark
                            
                                How to install Apache Toree for Spark Kernel in Jupyter in (ana)conda environment?
                            
                                Spark random forest binary classifier metrics
                            
                                Spark History Server on S3A FileSystem: ClassNotFoundException
                            
                                Hive on Spark list all partitions for specific hive table and adding a partition
                            
                                value read is not a member of org.apache.spark.SparkContext
                            
                                scala.MatchError: [Ljava.lang.String; (of class [Ljava.lang.String;)
                            
                                Inserting Data Into Cassandra table Using Spark DataFrame
                            
                                foreach function not working in Spark DataFrame
                            
                                Dropping columns by data type in Scala Spark
                            
                                Spark: unpersist RDDs for which I have lost the reference
                            
                                Redirect Spark console logs into a file
                            
                                How to expire state of dropDuplicates in structured streaming to avoid OOM?
                            
                                Workaround for importing spark implicits everywhere
                            
                                spark-submit Error: No main class set in JAR; please specify one with --class
                            
                                java.lang.NoSuchMethodError: org.apache.hadoop.conf.Configuration.reloadExistingConfigurations()V
                            
                                Does Kryo help in SparkSQL?
                            
                                StackOverflowError when operating with a large number of columns in Spark
                            
                                How to write a Dataset to Kafka topic?
                            
                                how to use spark lag and lead over group by and order by

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

overwrite column values using other column values based on conditions pyspark

Tags:

apache-spark

pyspark

User12345

People also ask

2 Answers

pault

Ramesh Maharjan

Recent Activity

Donate For Us