Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Syntax while setting schema for Pyspark.sql using StructType

I am new to spark and was playing around with Pyspark.sql. According to the pyspark.sql documentation here, one can go about setting the Spark dataframe and schema like this:

spark= SparkSession.builder.getOrCreate() from pyspark.sql.types import StringType, IntegerType,  StructType, StructField  rdd = sc.textFile('./some csv_to_play_around.csv'  schema = StructType([StructField('Name', StringType(), True),                      StructField('DateTime', TimestampType(), True)                      StructField('Age', IntegerType(), True)])  # create dataframe df3 = sqlContext.createDataFrame(rdd, schema) 

My question is, what does the True stand for in the schema list above? I can't seem to find it in the documentation. Thanks in advance

like image 986
Jason Avatar asked May 13 '15 12:05

Jason


People also ask

How do I add a schema to a Pyspark Dataframe?

➠ Creating a new Schema: Pyspark stores dataframe schema as StructType object. add() function on StructType variable can be used to append new fields / columns to create a new Schema. add() function can take up to 4 parameters and last 3 parameters are optional.

What is StructType in Pyspark?

StructType – Defines the structure of the DataframePySpark provides from pyspark. sql. types import StructType class to define the structure of the DataFrame. StructType is a collection or list of StructField objects. PySpark printSchema() method on the DataFrame shows StructType columns as struct .


2 Answers

It means if the column allows null values, true for nullable, and false for not nullable

StructField(name, dataType, nullable): Represents a field in a StructType. The name of a field is indicated by name. The data type of a field is indicated by dataType. nullable is used to indicate if values of this fields can have null values.

Refer to Spark SQL and DataFrame Guide for more informations.

like image 147
yjshen Avatar answered Nov 01 '22 20:11

yjshen


You can also use a datatype string:

schema = 'Name STRING, DateTime TIMESTAMP, Age INTEGER' 

There's not much documentation on datatype strings, but they mention them in the docs. They're much more compact and readable than StructTypes

like image 31
pcv Avatar answered Nov 01 '22 21:11

pcv