I have an S3 location with the below directory structure with a Hive table created on top of it:
s3://<Mybucket>/<Table Name>/<day Partition>
Let's say I have a Spark program which writes data into above table location spanning multiple partitions using the below line of code:
Df.write.partitionBy("orderdate").parquet("s3://<Mybucket>/<Table Name>/")
If another program such as "Hive SQL query" or "AWS Athena Query" started reading data from the table at the same time:
Do they consider temporary files being written?
Does spark lock the data file while writing into S3 location?
How can we handle such concurrency situations using Spark as an ETL tool?
Spark is a fast and general processing engine compatible with Hadoop data. It can run in Hadoop clusters through YARN or Spark's standalone mode, and it can process data in HDFS, HBase, Cassandra, Hive, and any Hadoop InputFormat.
Though Spark supports to read from/write to files on multiple file systems like Amazon S3 , Hadoop HDFS , Azure , GCP e.t.c, the HDFS file system is mostly used at the time of writing this article. Also, like any other file system, we can read and write TEXT, CSV, Avro, Parquet and JSON files into HDFS.
From day one, Spark was designed to read and write data from and to HDFS, as well as other storage systems, such as HBase and Amazon's S3. As such, Hadoop users can enrich their processing capabilities by combining Spark with Hadoop MapReduce, HBase, and other big data frameworks.
Using spark. write. parquet() function we can write Spark DataFrame in Parquet file to Amazon S3. The parquet() function is provided in DataFrameWriter class.
Spark writes the output in a two-step process. First, it writes the data to _temporary
directory and then once the write operation is complete and successful, it moves the file to the output directory.
Do they consider temporary files being written?
As the files starting with _
are hidden files, you can not read them from Hive or AWS Athena.
Does spark lock the data file while writing into S3 location?
Locking or any concurrency mechanism is not required because of the simple two-step write process of spark.
How can we handle such concurrency situations using Spark as an ETL tool?
Again using the simple writing to temporary location mechanism.
One more thing to note here is, in your example above after writing output to the output directory you need to add the partition to hive external table using Alter table <tbl_name> add partition (...)
command or msck repair table tbl_name
command else data won't be available in hive.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With