I have 500 columns in my pyspark data frame...Some are of string type,some int and some boolean(100 boolean columns ). Now, all the boolean columns have two distinct levels - Yes and No and I want to convert those into 1/0
For string I have three values- passed, failed and null. How do I replace those nulls with 0? fillna(0) works only with integers
 c1| c2 |    c3 |c4|c5..... |c500
yes| yes|passed |45....
No | Yes|failed |452....
Yes|No  |None   |32............
when I do
df.replace(yes,1)
I get following error:
ValueError: Mixed type replacements are not supported
                For string I have three values- passed, failed and null. How do I replace those nulls with 0? fillna(0) works only with integers
First, import when and lit 
from pyspark.sql.functions import when, lit
Assuming your DataFrame has these columns
# Reconstructing my DataFrame based on your assumptions
# cols are Columns in the DataFrame
cols = ['name', 'age', 'col_with_string']
# Similarly the values
vals = [
     ('James', 18, 'passed'),
     ('Smith', 15, 'passed'),
     ('Albie', 32, 'failed'),
     ('Stacy', 33, None),
     ('Morgan', 11, None),
     ('Dwight', 12, None),
     ('Steve', 16, 'passed'), 
     ('Shroud', 22, 'passed'),
     ('Faze', 11,'failed'),
     ('Simple', 13, None)
]
# This will create a DataFrame using 'cols' and 'vals'
# spark is an object of SparkSession
df = spark.createDataFrame(vals, cols)
# We have the following DataFrame
df.show()
+------+---+---------------+
|  name|age|col_with_string|
+------+---+---------------+
| James| 18|         passed|
| Smith| 15|         passed|
| Albie| 32|         failed|
| Stacy| 33|           null|
|Morgan| 11|           null|
|Dwight| 12|           null|
| Steve| 16|         passed|
|Shroud| 22|         passed|
|  Faze| 11|         failed|
|Simple| 13|           null|
+------+---+---------------+
You can use:
I can replace the values having null with 0
df = df.withColumn('col_with_string', when(df.col_with_string.isNull(), 
lit('0')).otherwise(df.col_with_string))
# We have replaced nulls with a '0'
df.show()
+------+---+---------------+
|  name|age|col_with_string|
+------+---+---------------+
| James| 18|         passed|
| Smith| 15|         passed|
| Albie| 32|         failed|
| Stacy| 33|              0|
|Morgan| 11|              0|
|Dwight| 12|              0|
| Steve| 16|         passed|
|Shroud| 22|         passed|
|  Faze| 11|         failed|
|Simple| 13|              0|
+------+---+---------------+
Part 1 of your question: Yes/No boolean values - you mentioned that, there are 100 columns of Boolean's. For this, I generally reconstruct the table with updated values or create a UDF returns 1 or 0 for Yes or No.
I am adding two more columns can_vote and can_lotto to the DataFrame (df)
df = df.withColumn("can_vote", col('Age') >= 18)
df = df.withColumn("can_lotto", col('Age') > 16) 
# Updated DataFrame will be
df.show()
+------+---+---------------+--------+---------+
|  name|age|col_with_string|can_vote|can_lotto|
+------+---+---------------+--------+---------+
| James| 18|         passed|    true|     true|
| Smith| 15|         passed|   false|    false|
| Albie| 32|         failed|    true|     true|
| Stacy| 33|              0|    true|     true|
|Morgan| 11|              0|   false|    false|
|Dwight| 12|              0|   false|    false|
| Steve| 16|         passed|   false|    false|
|Shroud| 22|         passed|    true|     true|
|  Faze| 11|         failed|   false|    false|
|Simple| 13|              0|   false|    false|
+------+---+---------------+--------+---------+
Assuming you have similar columns to can_vote and can_lotto (boolean values being Yes/No)
You can use the following line of code to fetch the columns in the DataFrame having boolean type
col_with_bool = [item[0] for item in df.dtypes if item[1].startswith('boolean')]
This returns a list
['can_vote', 'can_lotto']
You can create a UDF and iterate for each column in this type of list, lit each of the columns using 1 (Yes) or 0 (No).
For reference, refer to the following links
I tried to replicate you issue with the below data:
df_test=pd.DataFrame([['yes','pass',1.2],['No','pass',34],['yes',None,0.4],[0,1,'No'],['No',1,True],['NO','YES',1]])
then I just use:
df_test.replace('yes',1)
                        If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With