Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to get the schema definition from a dataframe in PySpark?

In PySpark it you can define a schema and read data sources with this pre-defined schema, e. g.:

Schema = StructType([ StructField("temperature", DoubleType(), True),
                      StructField("temperature_unit", StringType(), True),
                      StructField("humidity", DoubleType(), True),
                      StructField("humidity_unit", StringType(), True),
                      StructField("pressure", DoubleType(), True),
                      StructField("pressure_unit", StringType(), True)
                    ])

For some datasources it is possible to infer the schema from the data-source and get a dataframe with this schema definition.

Is it possible to get the schema definition (in the form described above) from a dataframe, where the data has been inferred before?

df.printSchema() prints the schema as a tree, but I need to reuse the schema, having it defined as above,so I can read a data-source with this schema that has been inferred before from another data-source.

like image 657
Hauke Mallow Avatar asked Feb 03 '19 12:02

Hauke Mallow


People also ask

How do I print a schema of a DataFrame in Pyspark?

sql. DataFrame. printSchema() is used to print or display the schema of the DataFrame in the tree format along with column name and data type. If you have DataFrame with a nested structure it displays schema in a nested tree format.

How do I get the schema of a column Pyspark?

You can find all column names & data types (DataType) of PySpark DataFrame by using df. dtypes and df. schema and you can also retrieve the data type of a specific column name using df. schema["name"].


2 Answers

Yes it is possible. Use DataFrame.schema property

schema

Returns the schema of this DataFrame as a pyspark.sql.types.StructType.

>>> df.schema
StructType(List(StructField(age,IntegerType,true),StructField(name,StringType,true)))

New in version 1.3.

Schema can be also exported to JSON and imported back if needed.

like image 153
user11008525 Avatar answered Oct 21 '22 03:10

user11008525


The code below will give you a well formatted tabular schema definition of the known dataframe. Quite useful when you have very huge number of columns & where editing is cumbersome. You can then now apply it to your new dataframe & hand-edit any columns you may want to accordingly.

from pyspark.sql.types import StructType

schema = [i for i in df.schema] 

And then from here, you have your new schema:

NewSchema = StructType(schema)
like image 10
Laenka-Oss Avatar answered Oct 21 '22 02:10

Laenka-Oss