Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Does Spark lock the File while writing to HDFS or S3

I have an S3 location with the below directory structure with a Hive table created on top of it:

s3://<Mybucket>/<Table Name>/<day Partition>

Let's say I have a Spark program which writes data into above table location spanning multiple partitions using the below line of code:

Df.write.partitionBy("orderdate").parquet("s3://<Mybucket>/<Table Name>/")

If another program such as "Hive SQL query" or "AWS Athena Query" started reading data from the table at the same time:

Do they consider temporary files being written?

Does spark lock the data file while writing into S3 location?

How can we handle such concurrency situations using Spark as an ETL tool?

like image 296
kalyan chakravarthy Avatar asked Mar 19 '18 22:03

kalyan chakravarthy


People also ask

Does Spark store data in HDFS?

Spark is a fast and general processing engine compatible with Hadoop data. It can run in Hadoop clusters through YARN or Spark's standalone mode, and it can process data in HDFS, HBase, Cassandra, Hive, and any Hadoop InputFormat.

Does Spark Write to HDFS?

Though Spark supports to read from/write to files on multiple file systems like Amazon S3 , Hadoop HDFS , Azure , GCP e.t.c, the HDFS file system is mostly used at the time of writing this article. Also, like any other file system, we can read and write TEXT, CSV, Avro, Parquet and JSON files into HDFS.

How does Spark work with HDFS?

From day one, Spark was designed to read and write data from and to HDFS, as well as other storage systems, such as HBase and Amazon's S3. As such, Hadoop users can enrich their processing capabilities by combining Spark with Hadoop MapReduce, HBase, and other big data frameworks.

Can Spark write to S3?

Using spark. write. parquet() function we can write Spark DataFrame in Parquet file to Amazon S3. The parquet() function is provided in DataFrameWriter class.


1 Answers

Spark writes the output in a two-step process. First, it writes the data to _temporary directory and then once the write operation is complete and successful, it moves the file to the output directory.

Do they consider temporary files being written?

As the files starting with _ are hidden files, you can not read them from Hive or AWS Athena.

Does spark lock the data file while writing into S3 location?

Locking or any concurrency mechanism is not required because of the simple two-step write process of spark.

How can we handle such concurrency situations using Spark as an ETL tool?

Again using the simple writing to temporary location mechanism.

One more thing to note here is, in your example above after writing output to the output directory you need to add the partition to hive external table using Alter table <tbl_name> add partition (...) command or msck repair table tbl_name command else data won't be available in hive.

like image 199
wypul Avatar answered Oct 04 '22 15:10

wypul