Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to save result of printSchema to a file in PySpark

I have used df.printSchema() in pyspark and it gives me the schema with tree structure. Now i need to save it in a variable or a text file.

I have tried below methods of saving but they didn't work.

v = str(df.printSchema())  
print(v) 
#and
df.printSchema().saveAsTextFile(<path>)

I need the saved schema in below format

|-- COVERSHEET: struct (nullable = true)                              
 |    |-- ADDRESSES: struct (nullable = true)
 |    |    |-- ADDRESS: struct (nullable = true)
 |    |    |    |-- _VALUE: string (nullable = true)
 |    |    |    |-- _city: string (nullable = true)
 |    |    |    |-- _primary: long (nullable = true)
 |    |    |    |-- _state: string (nullable = true)
 |    |    |    |-- _street: string (nullable = true)
 |    |    |    |-- _type: string (nullable = true)
 |    |    |    |-- _zip: long (nullable = true)
 |    |-- CONTACTS: struct (nullable = true)
 |    |    |-- CONTACT: array (nullable = true)
 |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |-- _VALUE: string (nullable = true)
 |    |    |    |    |-- _name: string (nullable = true)
 |    |    |    |    |-- _type: string (nullable = true)
like image 447
Ahito Avatar asked Jun 12 '18 12:06

Ahito


People also ask

How do you use printSchema in PySpark?

PySpark – printSchema() The printSchema() method is used to display the schema of the PySpark dataframe. Before going to see this, we have to create a DataFrame with Schema. PySpark provides the StructType() and StructField() methods which are used to define the columns in the PySpark DataFrame.

How do I print a schema of a DataFrame in PySpark?

To get the schema of the Spark DataFrame, use printSchema() on Spark DataFrame object. From the above example, printSchema() prints the schema to console( stdout ) and show() displays the content of the Spark DataFrame.


2 Answers

You need treeString (which for some reason, I couldn't find in the python API)

#v will be a string
v = df._jdf.schema().treeString()

You can convert it to a RDD and use saveAsTextFile

sc.parallelize([v]).saveAsTextFile(...)

Or use Python specific API to write a String to a file.

like image 145
philantrovert Avatar answered Nov 15 '22 14:11

philantrovert


You can also use the following:

temp_rdd = sc.parallelize(schema)
temp_rdd.coalesce(1).saveAsPickleFile("s3a://path/to/destination_schema.pickle")
like image 43
Rene B. Avatar answered Nov 15 '22 13:11

Rene B.