I am new to spark and was playing around with Pyspark.sql. According to the pyspark.sql documentation here, one can go about setting the Spark dataframe and schema like this:
spark= SparkSession.builder.getOrCreate() from pyspark.sql.types import StringType, IntegerType, StructType, StructField rdd = sc.textFile('./some csv_to_play_around.csv' schema = StructType([StructField('Name', StringType(), True), StructField('DateTime', TimestampType(), True) StructField('Age', IntegerType(), True)]) # create dataframe df3 = sqlContext.createDataFrame(rdd, schema)
My question is, what does the True
stand for in the schema
list above? I can't seem to find it in the documentation. Thanks in advance
➠ Creating a new Schema: Pyspark stores dataframe schema as StructType object. add() function on StructType variable can be used to append new fields / columns to create a new Schema. add() function can take up to 4 parameters and last 3 parameters are optional.
StructType – Defines the structure of the DataframePySpark provides from pyspark. sql. types import StructType class to define the structure of the DataFrame. StructType is a collection or list of StructField objects. PySpark printSchema() method on the DataFrame shows StructType columns as struct .
It means if the column allows null values, true
for nullable, and false
for not nullable
StructField(name, dataType, nullable): Represents a field in a StructType. The name of a field is indicated by name. The data type of a field is indicated by dataType. nullable is used to indicate if values of this fields can have null values.
Refer to Spark SQL and DataFrame Guide for more informations.
You can also use a datatype string:
schema = 'Name STRING, DateTime TIMESTAMP, Age INTEGER'
There's not much documentation on datatype strings, but they mention them in the docs. They're much more compact and readable than StructTypes
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With