We have a Hadoop-based solution (CDH 5.15) where we are getting new files in HDFS in some directories. On top os those directories we have 4-5 Impala (2.1) tables. The process writing those files in HDFS is Spark Structured Streaming (2.3.1) Right now, we are running some DDL queries as soon as we get the files written to HDFS: <ul> <li><code>ALTER TABLE table1 RECOVER PARTITONS</code> to detect new partitions (and their HDFS directories and files) added to the table.</li> <li><code>REFRESH table1 PARTITIONS (partition1=X, partition2=Y)</code>, using all the keys for each partition.</li> </ul> Right now, this DDL is taking a bit too long and they are getting queued in our system, damaging the data availability of the system. So, my question is: Is there a way to do this data incorporation more efficiently? We have considered: <ul> <li>Using the <code>ALTER TABLE .. RECOVER PARTITONS</code> but as per the documentation, it only refreshes new partitions.</li> <li>Tried to use <code>REFRESH .. PARTITON ...</code> with multiple partitions at once, but the statement syntaxis does not allow to do that. </li> <li>Tried batching the queries but the Hive JDBC drives does not support batching queries.</li> <li>Shall we try to do those updates in parallel given that the system is already busy?</li> <li>Any other way you are aware of?</li> </ul> Thanks! Victor Note: The way in which we know what partitions need refreshed is by using HDFS events as with Spark Structured Streaming we don´t know exactly when the files are written. Note #2: Also, the files written in HDFS are sometimes small, so it would be great if it could be possible to merge those files at the same time.

Since nobody seems to have the answer for my problem, I would like to share the approach we took to make this processing more efficient, comments are very welcome. We discovered (doc. is not very clear on this) that some of the information stored in the Spark "checkpoints" in HDFS is a number of metadata files describing when each Parquet file was written and how big was it: <pre class="prettyprint"><code>$hdfs dfs -ls -h hdfs://...../my_spark_job/_spark_metadata w-r--r-- 3 hdfs 68K 2020-02-26 20:49 hdfs://...../my_spark_job/_spark_metadata/3248 rw-r--r-- 3 hdfs 33.3M 2020-02-26 20:53 hdfs://...../my_spark_job/_spark_metadata/3249.compact w-r--r-- 3 hdfs 68K 2020-02-26 20:54 hdfs://...../my_spark_job/_spark_metadata/3250 ... $hdfs dfs -cat hdfs://...../my_spark_job/_spark_metadata/3250 v1 {"path":"hdfs://.../my_spark_job/../part-00004.c000.snappy.parquet","size":9866555,"isDir":false,"modificationTime":1582750862638,"blockReplication":3,"blockSize":134217728,"action":"add"} {"path":"hdfs://.../my_spark_job/../part-00004.c001.snappy.parquet","size":526513,"isDir":false,"modificationTime":1582750862834,"blockReplication":3,"blockSize":134217728,"action":"add"} ... </code></pre> So, what we did was: <ul> <li>Build a Spark Streaming Job polling that <code>_spark_metadata</code> folder. <ul> <li>We use a <code>fileStream</code> since it allow us to define the file filter to use.</li> <li>Each entry in that stream is one of those JSON lines, which is parsed to extract the file path and size.</li> </ul> </li> <li>Group the files by the parent folder (which maps to each Impala partition) they belong to.</li> <li>For each folder: <ul> <li>Read a dataframe loading only the targeted Parquet files (to avoid race conditions with the other job writing the files)</li> <li>Calculate how many blocks to write (using the size field in the JSON and a target block size)</li> <li>Coalesce the dataframe to the desired number of partitions and write it back to HDFS</li> <li>Execute the DDL <code>REFRESH TABLE myTable PARTITION ([partition keys derived from the new folder]</code> </li> </ul> </li> <li>Finally, delete the source files</li> </ul> What we achieved is: <ul> <li>Limit the DDLs, by doing one refresh per partition and batch.</li> <li>By having batch time and block size configurable, we are able to adapt our product to different deployment scenarios with bigger or smaller datasets.</li> <li>The solution is quite flexible, since we can assign more or less resources to the Spark Streaming job (executors, cores, memory, etc.) and also we can start/stop it (using its own checkpointing system).</li> <li>We are also studying the possibily of applying some data repartitioning, while doing this process, to have partitions as close as possible to the most optimum size.</li> </ul>

How to efficiently update Impala tables whose files are modified very frequently

Tags:

hadoop

cloudera-cdh

impala

spark-structured-streaming

We have a Hadoop-based solution (CDH 5.15) where we are getting new files in HDFS in some directories. On top os those directories we have 4-5 Impala (2.1) tables. The process writing those files in HDFS is Spark Structured Streaming (2.3.1)

Right now, we are running some DDL queries as soon as we get the files written to HDFS:

ALTER TABLE table1 RECOVER PARTITONS to detect new partitions (and their HDFS directories and files) added to the table.
REFRESH table1 PARTITIONS (partition1=X, partition2=Y), using all the keys for each partition.

Right now, this DDL is taking a bit too long and they are getting queued in our system, damaging the data availability of the system.

So, my question is: Is there a way to do this data incorporation more efficiently?

We have considered:

Using the ALTER TABLE .. RECOVER PARTITONS but as per the documentation, it only refreshes new partitions.
Tried to use REFRESH .. PARTITON ... with multiple partitions at once, but the statement syntaxis does not allow to do that.
Tried batching the queries but the Hive JDBC drives does not support batching queries.
Shall we try to do those updates in parallel given that the system is already busy?
Any other way you are aware of?

Thanks!

Victor

Note: The way in which we know what partitions need refreshed is by using HDFS events as with Spark Structured Streaming we don´t know exactly when the files are written.

Note #2: Also, the files written in HDFS are sometimes small, so it would be great if it could be possible to merge those files at the same time.

563

asked Feb 06 '20 08:02

Victor

1 Answers

Since nobody seems to have the answer for my problem, I would like to share the approach we took to make this processing more efficient, comments are very welcome.

We discovered (doc. is not very clear on this) that some of the information stored in the Spark "checkpoints" in HDFS is a number of metadata files describing when each Parquet file was written and how big was it:

$hdfs dfs -ls -h hdfs://...../my_spark_job/_spark_metadata

w-r--r--   3 hdfs 68K   2020-02-26 20:49 hdfs://...../my_spark_job/_spark_metadata/3248
rw-r--r--  3 hdfs 33.3M 2020-02-26 20:53 hdfs://...../my_spark_job/_spark_metadata/3249.compact
w-r--r--   3 hdfs 68K   2020-02-26 20:54 hdfs://...../my_spark_job/_spark_metadata/3250
...

$hdfs dfs -cat hdfs://...../my_spark_job/_spark_metadata/3250
v1
{"path":"hdfs://.../my_spark_job/../part-00004.c000.snappy.parquet","size":9866555,"isDir":false,"modificationTime":1582750862638,"blockReplication":3,"blockSize":134217728,"action":"add"}
{"path":"hdfs://.../my_spark_job/../part-00004.c001.snappy.parquet","size":526513,"isDir":false,"modificationTime":1582750862834,"blockReplication":3,"blockSize":134217728,"action":"add"}
...

So, what we did was:

Build a Spark Streaming Job polling that _spark_metadata folder.
- We use a fileStream since it allow us to define the file filter to use.
- Each entry in that stream is one of those JSON lines, which is parsed to extract the file path and size.
Group the files by the parent folder (which maps to each Impala partition) they belong to.
For each folder:
- Read a dataframe loading only the targeted Parquet files (to avoid race conditions with the other job writing the files)
- Calculate how many blocks to write (using the size field in the JSON and a target block size)
- Coalesce the dataframe to the desired number of partitions and write it back to HDFS
- Execute the DDL REFRESH TABLE myTable PARTITION ([partition keys derived from the new folder]
Finally, delete the source files

What we achieved is:

Limit the DDLs, by doing one refresh per partition and batch.
By having batch time and block size configurable, we are able to adapt our product to different deployment scenarios with bigger or smaller datasets.
The solution is quite flexible, since we can assign more or less resources to the Spark Streaming job (executors, cores, memory, etc.) and also we can start/stop it (using its own checkpointing system).
We are also studying the possibily of applying some data repartitioning, while doing this process, to have partitions as close as possible to the most optimum size.

answered Sep 27 '22 15:09

Victor

Related questions
                            
                                How to alter Hive partition column name
                            
                                spark-submit continues to hang after job completion
                            
                                Sorted word count using Hadoop MapReduce
                            
                                GlusterFS as the backend for Hadoop
                            
                                Build custom join logic in Cascading ensuring MAP_SIDE only
                            
                                Write Dataframe to Phoenix
                            
                                Sqoop - Import Job failed
                            
                                Use multiple Guava versions in same maven project
                            
                                MapReduce Output ArrayWritable
                            
                                How to write Map/Reduce tasks in Golang?
                            
                                How do I process a graph that is constantly updating, with low latency?
                            
                                Does Spark allow to use Amazon Assumed Role and STS temporary credentials for DynamoDB?
                            
                                HBase multiple column families performance
                            
                                Unable to connect to Hive2 using Python
                            
                                Designing HBase schema to best support specific queries
                            
                                Running wordcount sample using MRV1 on CDH4.0.1 VM
                            
                                Evaluate expression in HIVE set statements
                            
                                Accessing hive metastore using jdbc with kerberos keytab
                            
                                How to remove duplicate columns after a JOIN in Pig?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to efficiently update Impala tables whose files are modified very frequently

Tags:

hadoop

cloudera-cdh

impala

spark-structured-streaming

Victor

People also ask

1 Answers

Victor

Recent Activity

Donate For Us