Does Spark lock the File while writing to HDFS or S3

Tags:

apache-spark-sql

I have an S3 location with the below directory structure with a Hive table created on top of it:

s3://<Mybucket>/<Table Name>/<day Partition>

Let's say I have a Spark program which writes data into above table location spanning multiple partitions using the below line of code:

Df.write.partitionBy("orderdate").parquet("s3://<Mybucket>/<Table Name>/")

If another program such as "Hive SQL query" or "AWS Athena Query" started reading data from the table at the same time:

Do they consider temporary files being written?

Does spark lock the data file while writing into S3 location?

How can we handle such concurrency situations using Spark as an ETL tool?

296

asked Mar 19 '18 22:03

1 Answers

Spark writes the output in a two-step process. First, it writes the data to _temporary directory and then once the write operation is complete and successful, it moves the file to the output directory.

Do they consider temporary files being written?

As the files starting with _ are hidden files, you can not read them from Hive or AWS Athena.

Does spark lock the data file while writing into S3 location?

Locking or any concurrency mechanism is not required because of the simple two-step write process of spark.

How can we handle such concurrency situations using Spark as an ETL tool?

Again using the simple writing to temporary location mechanism.

One more thing to note here is, in your example above after writing output to the output directory you need to add the partition to hive external table using Alter table <tbl_name> add partition (...) command or msck repair table tbl_name command else data won't be available in hive.

199

answered Oct 04 '22 15:10

wypul

Related questions
                            
                                Spark 2.4 & Java 11 compatibility [duplicate]
                            
                                Running an Apache Spark Program on YARN from IntelliJ IDEA
                            
                                Databricks (Spark): .egg dependencies not installed automatically?
                            
                                How to do logging with Spark in local mode?
                            
                                Save null Values in Cassandra using DataStax Spark Connector
                            
                                Doc2Vec and PySpark: Gensim Doc2vec over DeepDist
                            
                                Spark Caching: RDD Only 8% cached
                            
                                Scala & Spark: Recycling SQL statements
                            
                                PySpark: How to evaluate AUC of ML recomendation algorithm?
                            
                                Clean invalid characters from data held in a Spark RDD
                            
                                Spark colocated join between two partitioned dataframes
                            
                                How to use a PySpark UDF in a Scala Spark project?
                            
                                How to run simple Spark app from Eclipse/Intellij IDE?
                            
                                Working Around Performance & Memory Issues with spark-sql GROUP BY
                            
                                scala.ScalaReflectionException: <none> is not a term
                            
                                Accessing HBase tables through Spark
                            
                                Running Spark on AWS EMR, how to run driver on master node?
                            
                                how can you calculate the size of an apache spark data frame using pyspark?
                            
                                Spark 2.3 submit on Kubernetes error

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Does Spark lock the File while writing to HDFS or S3

Tags:

apache-spark

apache-spark-sql

kalyan chakravarthy

People also ask

1 Answers

wypul

Recent Activity

Donate For Us