Using when and otherwise while converting boolean values to strings in Pyspark

Tags:

2 Answers

As I mentioned in the comments, the issue is a type mismatch. You need to convert the boolean column to a string before doing the comparison. Finally, you need to cast the column to a string in the otherwise() as well (you can't have mixed types in a column).

Your code is easy to modify to get the correct output:

import pyspark.sql.functions as f

cols = ["testing", "active"]
for col in cols:
    df = df.withColumn(
        col, 
        f.when(
            f.col(col) == 'N',
            'False'
        ).when(
            f.col(col) == 'Y',
            'True'
        ).when(
            f.col(col).cast('string') == 'true',
            'True'
        ).when(
            f.col(col).cast('string') == 'false',
            'False'
        ).otherwise(f.col(col).cast('string'))
    )
df.show()
#+---+----+-------+----------+-----+------+
#| id|name|testing|avg_result|score|active|
#+---+----+-------+----------+-----+------+
#|  1| sam|   null|      null| null|  True|
#|  2| Ram|   True|      0.05|   10| False|
#|  3| Ian|  False|      0.01|    1| False|
#|  4| Jim|  False|       1.2|    3|  True|
#+---+----+-------+----------+-----+------+

However, there are some alternative approaches as well. For instance, this is a good place to use pyspark.sql.Column.isin():

df = reduce(
    lambda df, col: df.withColumn(
        col, 
        f.when(
            f.col(col).cast('string').isin(['N', 'false']),
            'False'
        ).when(
            f.col(col).cast('string').isin(['Y', 'true']),
            'True'
        ).otherwise(f.col(col).cast('string'))
    ),
    cols,
    df
)
df.show()
#+---+----+-------+----------+-----+------+
#| id|name|testing|avg_result|score|active|
#+---+----+-------+----------+-----+------+
#|  1| sam|   null|      null| null|  True|
#|  2| Ram|   True|      0.05|   10| False|
#|  3| Ian|  False|      0.01|    1| False|
#|  4| Jim|  False|       1.2|    3|  True|
#+---+----+-------+----------+-----+------+

(Here I used reduce to eliminate the for loop, but you could have kept it.)

You could also use pyspark.sql.DataFrame.replace() but you'd have to first convert the column active to a string:

df = df.withColumn('active', f.col('active').cast('string'))\
    .replace(['Y', 'true',], 'True', subset=cols)\
    .replace(['N', 'false'], 'False', subset=cols)\
df.show()
# results omitted, but it's the same as above

Or using replace just once:

df = df.withColumn('active', f.col('active').cast('string'))\
    .replace(['Y', 'true', 'N', 'false'], ['True', 'True', 'False', 'False'], subset=cols)

185

answered Oct 14 '22 13:10

pault

Looking at the schema and the transformations applied, there is a type mismatch between String and Boolean returned. E.g. 'N' is returned as 'False' (String) and 'false' is returned as False (Boolean)

You can cast the transformed columns to String to convert Y to True, N to False, true to True and false to False.

from pyspark.sql.functions import *
from pyspark.sql.types import *
from pyspark.sql import functions as f

data = [
  (1, "sam", None, None, None, True),
  (2, "Ram", "Y", 0.05, 10, False),
  (3, "Ian", "N", 0.01, 1, False),
  (4, "Jim", "N", 1.2, 3, True)
  ]

schema = StructType([
  StructField("id", IntegerType(), True),
  StructField("name", StringType(), True),
  StructField("testing", StringType(), True),
  StructField("avg_result", StringType(), True),
  StructField("score", StringType(), True),
  StructField("active", BooleanType(), True)
  ])

df = sc.parallelize(data).toDF(schema)

Before applying the transformations

>>> df.printSchema()
root
|-- id: integer (nullable = true)
|-- name: string (nullable = true)
|-- testing: string (nullable = true)
|-- avg_result: string (nullable = true)
|-- score: string (nullable = true)
|-- active: boolean (nullable = true)

>>> df.show()
+---+----+-------+----------+-----+------+
| id|name|testing|avg_result|score|active|
+---+----+-------+----------+-----+------+
|  1| sam|   null|      null| null|  true|
|  2| Ram|      Y|      0.05|   10| false|
|  3| Ian|      N|      0.01|    1| false|
|  4| Jim|      N|       1.2|    3|  true|
+---+----+-------+----------+-----+------+

Applying transformation with cast in the otherwise clause .otherwise(f.col(col).cast("string"))

cols = ["testing", "active"]

for col in cols:
    df = df.withColumn(col, 
      f.when(f.col(col) == 'N', 'False')
      .when(f.col(col) == 'Y', 'True')
      .when(f.col(col).cast("string") == 'true', 'True')
      .when(f.col(col).cast("string") == 'false', 'False'))

Results

>>> df.printSchema()
root
|-- id: integer (nullable = true)
|-- name: string (nullable = true)
|-- testing: string (nullable = true)
|-- avg_result: string (nullable = true)
|-- score: string (nullable = true)
|-- active: string (nullable = true)

>>> df.show()
+---+----+-------+----------+-----+------+
| id|name|testing|avg_result|score|active|
+---+----+-------+----------+-----+------+
|  1| sam|   null|      null| null|  True|
|  2| Ram|   True|      0.05|   10| False|
|  3| Ian|  False|      0.01|    1| False|
|  4| Jim|  False|       1.2|    3|  True|
+---+----+-------+----------+-----+------+

answered Oct 14 '22 12:10

raj

Related questions
                            
                                turning pandas to pyspark expression
                            
                                Zeppelin java.lang.NoClassDefFoundError: Could not initialize class org.apache.spark.rdd.RDDOperationScope$
                            
                                Apache Spark - Dataset operations fail in abstract base class?
                            
                                Sort by date an Array of a Spark DataFrame Column
                            
                                Scala + SBT - How to configure reference.conf for a shaded Akka library
                            
                                Processing (OSM) PBF files in Spark
                            
                                Using stat.bloomFilter in Spark 2.0.0 to filter another dataframe
                            
                                Spark SQL "Limit"
                            
                                spark-submit config through file
                            
                                Scala/ Spark- Multiply an Integer with each value in a Dataframe Column
                            
                                How to enable Tungsten optimization in Spark 2?
                            
                                Retrieve Spark Mllib StringIndexer column mapping
                            
                                Efficient way to join a cached spark dataframe with other and cache again
                            
                                Is it the driver or the workers who reads the text file when sc.textfile is used?
                            
                                maximum number of columns we can have in dataframe spark scala
                            
                                How to enable spark-history server for standalone cluster non hdfs mode
                            
                                How to use Column.isin with array column in join?
                            
                                Spark SQL - DataFrame - select - transformation or action?
                            
                                AssertionError: all exprs should be Column
                            
                                Read json from Kafka and write json to other Kafka topic

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Using when and otherwise while converting boolean values to strings in Pyspark

Tags:

apache-spark

pyspark

User12345

People also ask

2 Answers

pault

raj

Recent Activity

Donate For Us