When saving as a textfile in spark version 1.5.1 I use: <code>rdd.saveAsTextFile('<drectory>')</code>. But if I want to find the file in that direcotry, how do I name it what I want? Currently, I think it is named <code>part-00000</code>, which must be some default. How do I give it a name?

The correct answer to this question is that <code>saveAsTextFile</code> does not allow you to name the actual file. The reason for this is that the data is partitioned and within the path given as a parameter to the call to <code>saveAsTextFile(...)</code>, it will treat that as a directory and then write one file per partition. You can call <code>rdd.coalesce(1).saveAsTextFile('/some/path/somewhere')</code> and it will create <code>/some/path/somewhere/part-0000.txt</code>. If you need more control than this, you will need to do an actual file operation on your end after you do a <code>rdd.collect()</code>. Notice, this will pull all data into one executor so you may run into memory issues. That's the risk you take.

How to name file when saveAsTextFile in spark?

2 Answers

The correct answer to this question is that saveAsTextFile does not allow you to name the actual file.

The reason for this is that the data is partitioned and within the path given as a parameter to the call to saveAsTextFile(...), it will treat that as a directory and then write one file per partition.

You can call rdd.coalesce(1).saveAsTextFile('/some/path/somewhere') and it will create /some/path/somewhere/part-0000.txt.

If you need more control than this, you will need to do an actual file operation on your end after you do a rdd.collect().

Notice, this will pull all data into one executor so you may run into memory issues. That's the risk you take.

102

answered Oct 22 '22 12:10

nod

It's not possible to name the file as @nod said. However, it's possible to rename the file right afterward. An example using PySpark:

sc._jsc.hadoopConfiguration().set(
    "mapred.output.committer.class",
    "org.apache.hadoop.mapred.FileOutputCommitter")
URI = sc._gateway.jvm.java.net.URI
Path = sc._gateway.jvm.org.apache.hadoop.fs.Path
FileSystem = sc._gateway.jvm.org.apache.hadoop.fs.FileSystem
fs = FileSystem.get(URI("s3://{bucket_name}"), sc._jsc.hadoopConfiguration())
file_path = "s3://{bucket_name}/processed/source={source_name}/year={partition_year}/week={partition_week}/"
# remove data already stored if necessary
fs.delete(Path(file_path))

df.saveAsTextFile(file_path, compressionCodecClass="org.apache.hadoop.io.compress.GzipCodec")

# rename created file
created_file_path = fs.globStatus(Path(file_path + "part*.gz"))[0].getPath()
fs.rename(
    created_file_path,
    Path(file_path + "{desired_name}.jl.gz"))

answered Oct 22 '22 12:10

Juan Riaza

Related questions
                            
                                PySpark: Randomize rows in dataframe
                            
                                Spark "replacing null with 0" performance comparison
                            
                                Can SparkContext and StreamingContext co-exist in the same program?
                            
                                How to find pyspark dataframe memory usage?
                            
                                How to do count(*) within a spark dataframe groupBy
                            
                                User defined function to be applied to Window in PySpark?
                            
                                How does the fold action work in Spark?
                            
                                Calculating percentage of total count for groupBy using pyspark
                            
                                Why does sortBy transformation trigger a Spark job?
                            
                                Error initializing SparkContext: A master URL must be set in your configuration
                            
                                Does Spark preserve record order when reading in ordered files?
                            
                                Convert spark dataframe to Array[String]
                            
                                Reading data from Azure Blob with Spark
                            
                                Understanding Spark RandomForest featureImportances results
                            
                                collect() or toPandas() on a large DataFrame in pyspark/EMR
                            
                                Spark: JavaRDD<Tuple2> to JavaPairRDD<>
                            
                                How to create a Row from a List or Array in Spark using Scala
                            
                                How to find out the amount of memory pyspark has from iPython interface?
                            
                                Spark Submit fails with java.lang.NoSuchMethodError: scala.Predef$.$conforms()Lscala/Predef$$less$colon$less;
                            
                                Apache Spark: What is the equivalent implementation of RDD.groupByKey() using RDD.aggregateByKey()?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to name file when saveAsTextFile in spark?

Tags:

apache-spark

rdd

pyspark

makansij

People also ask

2 Answers

nod

Juan Riaza

Recent Activity

Donate For Us