Syntax while setting schema for Pyspark.sql using StructType

Tags:

apache-spark

pyspark

I am new to spark and was playing around with Pyspark.sql. According to the pyspark.sql documentation here, one can go about setting the Spark dataframe and schema like this:

spark= SparkSession.builder.getOrCreate() from pyspark.sql.types import StringType, IntegerType,  StructType, StructField  rdd = sc.textFile('./some csv_to_play_around.csv'  schema = StructType([StructField('Name', StringType(), True),                      StructField('DateTime', TimestampType(), True)                      StructField('Age', IntegerType(), True)])  # create dataframe df3 = sqlContext.createDataFrame(rdd, schema)

My question is, what does the True stand for in the schema list above? I can't seem to find it in the documentation. Thanks in advance

986

asked May 13 '15 12:05

Jason

2 Answers

It means if the column allows null values, true for nullable, and false for not nullable

StructField(name, dataType, nullable): Represents a field in a StructType. The name of a field is indicated by name. The data type of a field is indicated by dataType. nullable is used to indicate if values of this fields can have null values.

Refer to Spark SQL and DataFrame Guide for more informations.

147

answered Nov 01 '22 20:11

yjshen

You can also use a datatype string:

schema = 'Name STRING, DateTime TIMESTAMP, Age INTEGER'

There's not much documentation on datatype strings, but they mention them in the docs. They're much more compact and readable than StructTypes

answered Nov 01 '22 21:11

pcv

Related questions
                            
                                Multiple spark jobs appending parquet data to same base path with partitioning
                            
                                What do the blue blocks in spark stage DAG visualisation UI mean?
                            
                                How to extract best parameters from a CrossValidatorModel
                            
                                Explode (transpose?) multiple columns in Spark SQL table
                            
                                Pyspark: explode json in column to multiple columns
                            
                                Spark Scala: How to transform a column in a DF
                            
                                Encoder for Row Type Spark Datasets
                            
                                How to checkpoint DataFrames?
                            
                                How to load Spark Cassandra Connector in the shell?
                            
                                How does the pyspark mapPartitions function work?
                            
                                How to create dataframe from list in Spark SQL?
                            
                                Dropping a nested column from Spark DataFrame
                            
                                Skewed dataset join in Spark?
                            
                                How to use regex to include/exclude some input files in sc.textFile?
                            
                                Reading TSV into Spark Dataframe with Scala API
                            
                                spark createOrReplaceTempView vs createGlobalTempView
                            
                                How to calculate date difference in pyspark?
                            
                                How to convert Timestamp to Date format in DataFrame?
                            
                                Failed to Read Artifact Descriptor: IntelliJ
                            
                                Spark: How to kill running process without exiting shell?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With