I would like to do some cleanup at the start of my Spark program (Pyspark). For example, I would like to delete data from previous HDFS run. In pig this can be done using commands such as <pre class="prettyprint"><code>fs -copyFromLocal .... rmf /path/to-/hdfs </code></pre> or locally using sh command. I was wondering how to do the same with Pyspark.

You can delete an <code>hdfs</code> path in <code>PySpark</code> without using third party dependencies as follows: <pre class="prettyprint"><code>from pyspark.sql import SparkSession # example of preparing a spark session spark = SparkSession.builder.appName('abc').getOrCreate() sc = spark.sparkContext # Prepare a FileSystem manager fs = (sc._jvm.org .apache.hadoop .fs.FileSystem .get(sc._jsc.hadoopConfiguration()) ) path = "Your/hdfs/path" # use the FileSystem manager to remove the path fs.delete(sc._jvm.org.apache.hadoop.fs.Path(path), True) </code></pre> <hr> To improve one step further, you can wrap the above idea into a helper function that you can re-use across jobs/packages: <pre class="prettyprint"><code>from pyspark.sql import SparkSession spark = SparkSession.builder.appName('abc').getOrCreate() def delete_path(spark, path): sc = spark.sparkContext fs = (sc._jvm.org .apache.hadoop .fs.FileSystem .get(sc._jsc.hadoopConfiguration()) ) fs.delete(sc._jvm.org.apache.hadoop.fs.Path(path), True) delete_path(spark, "Your/hdfs/path") </code></pre>

pyspark and HDFS commands

Tags:

python

apache-spark

pyspark

hdfs

I would like to do some cleanup at the start of my Spark program (Pyspark). For example, I would like to delete data from previous HDFS run. In pig this can be done using commands such as

Click to copy

fs -copyFromLocal ....

rmf /path/to-/hdfs

or locally using sh command.

I was wondering how to do the same with Pyspark.

755

asked Dec 01 '15 04:12

user3803714

1 Answers

You can delete an hdfs path in PySpark without using third party dependencies as follows:

Click to copy

from pyspark.sql import SparkSession
# example of preparing a spark session
spark = SparkSession.builder.appName('abc').getOrCreate()
sc = spark.sparkContext
# Prepare a FileSystem manager
fs = (sc._jvm.org
      .apache.hadoop
      .fs.FileSystem
      .get(sc._jsc.hadoopConfiguration())
      )
path = "Your/hdfs/path"
# use the FileSystem manager to remove the path
fs.delete(sc._jvm.org.apache.hadoop.fs.Path(path), True)

To improve one step further, you can wrap the above idea into a helper function that you can re-use across jobs/packages:

Click to copy

from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('abc').getOrCreate()

def delete_path(spark, path):
    sc = spark.sparkContext
    fs = (sc._jvm.org
          .apache.hadoop
          .fs.FileSystem
          .get(sc._jsc.hadoopConfiguration())
          )
    fs.delete(sc._jvm.org.apache.hadoop.fs.Path(path), True)

delete_path(spark, "Your/hdfs/path")

answered Sep 19 '22 05:09

Mohamed Ali JAMAOUI

Related questions
                            
                                Display MNIST image using matplotlib [duplicate]
                            
                                How do I create a for-loop where the variable's value is equal to the stop value of range when the loop runs to the end in Python?
                            
                                Get feature importance from GridSearchCV
                            
                                Trouble with df.join(): ValueError: You are trying to merge on object and int64 columns
                            
                                Display GPU Usage While Code is Running in Colab
                            
                                How do I limit the number of active threads in python?
                            
                                Sql Alchemy What is wrong?
                            
                                How do I Filter the PyQt QCombobox Items based on the text input?
                            
                                Exclude field from values() or values_list()
                            
                                Split on either a space or a hyphen?
                            
                                Saving plot from seaborn
                            
                                ValueError: No axis named node2 for object type <class 'pandas.core.frame.DataFrame'>
                            
                                Split/Expand Dataframe based on column values
                            
                                Return a list of weekdays, starting with given weekday
                            
                                Python : Why use "list[:]" when "list" refers to same thing?
                            
                                What is the proper way to track indexes in python?
                            
                                Stubbing out functions or classes
                            
                                How can I hide a django label in a custom django form?
                            
                                Comparing two lists and only printing the differences? (XORing two lists)
                            
                                Multiple constructors in python, using inheritance

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

pyspark and HDFS commands

Tags:

python

apache-spark

pyspark

hdfs

user3803714

People also ask

1 Answers

Mohamed Ali JAMAOUI

Recent Activity

Donate For Us