I have a StructField in a dataframe that is not nullable. Simple example:
import pyspark.sql.functions as F
from pyspark.sql.types import *
l = [('Alice', 1)]
df = sqlContext.createDataFrame(l, ['name', 'age'])
df = df.withColumn('foo', F.when(df['name'].isNull(),False).otherwise(True))
df.schema.fields
which returns:
[StructField(name,StringType,true), StructField(age,LongType,true), StructField(foo,BooleanType,false)]
Notice that the field foo
is not nullable. Problem is that (for reasons I won't go into) I want it to be nullable. I found this post Change nullable property of column in spark dataframe which suggested a way of doing it so I adapted the code therein to this:
import pyspark.sql.functions as F
from pyspark.sql.types import *
l = [('Alice', 1)]
df = sqlContext.createDataFrame(l, ['name', 'age'])
df = df.withColumn('foo', F.when(df['name'].isNull(),False).otherwise(True))
df.schema.fields
newSchema = [StructField('name',StringType(),True), StructField('age',LongType(),True),StructField('foo',BooleanType(),False)]
df2 = sqlContext.createDataFrame(df.rdd, newSchema)
which failed with:
TypeError: StructField(name,StringType,true) is not JSON serializable
I also see this in the stack trace:
raise ValueError("Circular reference detected")
So I'm a bit stuck. Can anyone modify this example in a way that enables me to define a dataframe where column foo
is nullable?
I know this question is already answered, but I was looking for a more generic solution when I came up with this:
def set_df_columns_nullable(spark, df, column_list, nullable=True):
for struct_field in df.schema:
if struct_field.name in column_list:
struct_field.nullable = nullable
df_mod = spark.createDataFrame(df.rdd, df.schema)
return df_mod
You can then call it like this:
set_df_columns_nullable(spark,df,['name','age'])
For the general case, one can change the nullability of a column via the nullable
property of the StructField
of that specific column.
Here's an example:
df.schema['col_1']
# StructField(col_1,DoubleType,false)
df.schema['col_1'].nullable = True
df.schema['col_1']
# StructField(col_1,DoubleType,true)
Seems you missed the StructType(newSchema).
l = [('Alice', 1)]
df = sqlContext.createDataFrame(l, ['name', 'age'])
df = df.withColumn('foo', F.when(df['name'].isNull(),False).otherwise(True))
df.schema.fields
newSchema = [StructField('name',StringType(),True), StructField('age',LongType(),True),StructField('foo',BooleanType(),False)]
df2 = sqlContext.createDataFrame(df.rdd, StructType(newSchema))
df2.show()
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With