How to save result of printSchema to a file in PySpark

Tags:

I have used df.printSchema() in pyspark and it gives me the schema with tree structure. Now i need to save it in a variable or a text file.

I have tried below methods of saving but they didn't work.

v = str(df.printSchema())  
print(v) 
#and
df.printSchema().saveAsTextFile(<path>)

I need the saved schema in below format

|-- COVERSHEET: struct (nullable = true)                              
 |    |-- ADDRESSES: struct (nullable = true)
 |    |    |-- ADDRESS: struct (nullable = true)
 |    |    |    |-- _VALUE: string (nullable = true)
 |    |    |    |-- _city: string (nullable = true)
 |    |    |    |-- _primary: long (nullable = true)
 |    |    |    |-- _state: string (nullable = true)
 |    |    |    |-- _street: string (nullable = true)
 |    |    |    |-- _type: string (nullable = true)
 |    |    |    |-- _zip: long (nullable = true)
 |    |-- CONTACTS: struct (nullable = true)
 |    |    |-- CONTACT: array (nullable = true)
 |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |-- _VALUE: string (nullable = true)
 |    |    |    |    |-- _name: string (nullable = true)
 |    |    |    |    |-- _type: string (nullable = true)

447

asked Jun 12 '18 12:06

Ahito

2 Answers

You need treeString (which for some reason, I couldn't find in the python API)

#v will be a string
v = df._jdf.schema().treeString()

You can convert it to a RDD and use saveAsTextFile

sc.parallelize([v]).saveAsTextFile(...)

Or use Python specific API to write a String to a file.

145

answered Nov 15 '22 14:11

philantrovert

You can also use the following:

temp_rdd = sc.parallelize(schema)
temp_rdd.coalesce(1).saveAsPickleFile("s3a://path/to/destination_schema.pickle")

answered Nov 15 '22 13:11

Rene B.

Related questions
                            
                                Boolean numpy arrays with Cython
                            
                                Pybind11: Create and return numpy array from C++ side
                            
                                Python closures with generator
                            
                                flask-migrate cannot drop table because other objects depend on it
                            
                                Variable-length replacement with `re.sub()`
                            
                                Tor failing to run with Failed to bind one of the listener ports
                            
                                Cython: when should I define a string as char*, str, or bytes?
                            
                                Conditional Replace within a Column of a Numpy Array
                            
                                interactive scatter plot in bokeh with hover tool
                            
                                Iterable using yield or __next__()
                            
                                Conversion utf to ascii in python with pandas dataframe
                            
                                Test data predictions yield random results when making predictions from a saved model
                            
                                MongoEngine: difference between EmbeddedDocumentListField() and ListField(EmbeddedDocumentField())?
                            
                                difference between A[0] and A[0:1] numpy arrays in python
                            
                                How to run unitests of the form test/a.py?
                            
                                Cannot run Python script using sudo
                            
                                How do I export a single function as a module in Python? [duplicate]
                            
                                Pandas: Find rows where a particular column is not NA but all other columns are
                            
                                How do I import all functions from a package in python?
                            
                                Numpy efficient matrix self-multiplication (gram matrix)

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to save result of printSchema to a file in PySpark

Tags:

python

apache-spark

pyspark

Ahito

People also ask

2 Answers

philantrovert

Rene B.

Recent Activity

Donate For Us