I have a dataframe similar to below. I originally filled all null values with -1 to do my joins in Pyspark.
df = pd.DataFrame({'Number': ['1', '2', '-1', '-1'],
'Letter': ['A', '-1', 'B', 'A'],
'Value': [30, 30, 30, -1]})
pyspark_df = spark.createDataFrame(df)
+------+------+-----+
|Number|Letter|Value|
+------+------+-----+
| 1| A| 30|
| 2| -1| 30|
| -1| B| 30|
| -1| A| -1|
+------+------+-----+
After processing the dataset, I need to replace all -1 back to null values.
+------+------+-----+
|Number|Letter|Value|
+------+------+-----+
| 1| A| 30|
| 2| null| 30|
| null| B| 30|
| null| A| null|
+------+------+-----+
What's the easiest way to do this?
Another way to do this in a less verbose manner is to use replace
.
pyspark_df.replace(-1,None).replace('-1',None).show()
when+otherwise
shall do the trick:
import pyspark.sql.functions as F
pyspark_df.select([F.when(F.col(i).cast("Integer") <0 , None).otherwise(F.col(i)).alias(i)
for i in df.columns]).show()
+------+------+-----+
|Number|Letter|Value|
+------+------+-----+
| 1| A| 30|
| 2| null| 30|
| null| B| 30|
| null| A| null|
+------+------+-----+
You can scan all columns and replace -1
's with None:
import pyspark.sql.functions as F
for x in pyspark_df.columns:
pyspark_df = pyspark_df.withColumn(x, F.when(F.col(x)==-1, F.lit(None)).otherwise(F.col(x)))
pyspark_df.show()
Output:
+------+------+-----+
|Number|Letter|Value|
+------+------+-----+
| 1| A| 30|
| 2| null| 30|
| null| B| 30|
| null| A| null|
+------+------+-----+
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With