atomic hadoop fs move

Tags:

While building an infrastructure for one of my current projects I've faced the problem of replacement of already existing HDFS files. More precisely, I want to do the following:

We have a few machines (log-servers) which are continuously generating logs. We have a dedicated machine (log-preprocessor) which is responsible for receiving log chunks (each chunk is about 30 minutes in length and 500-800 mb in size) from log-servers, preprocessing them and uploading to HDFS of our Hadoop-cluster.

Preprocessing is done in 3 steps:

for each logserver: filter (in parallel) received log chunk (output file is about 60-80mb)
combine (merge-sort) all output files from the step1 and do some minor filtering (additionally, 30-min files are combined together into 1-hour files)
using current mapping from external DB, process the file from step#2 to obtain the final logfile and put this file to HDFS.

Final logfiles are to be used as input for several periodoc HADOOP-applications which are running on a HADOOP-cluster. In HDFS logfiles are stored as follows:

Click to copy

hdfs:/spool/.../logs/YYYY-MM-DD.HH.MM.log

Problem description:

The mapping which is used on step 3 changes over time and we need to reflect these changes by recalculating step3 and replacing old HDFS files with new ones. This update is performed with some periodicity (e.g. every 10-15 minutes) at least for last 12 hours. Please note that, if the mapping has changed, the result of applying step3 on the same input file may be significantly different (it will not be just a superset/subset of previous result). So we need to overwrite existing files in HDFS.

However, we can't just do hadoop fs -rm and then hadoop fs -copyToLocal because if some HADOOP-application is using the file which is temporary removed the app may fail. The solution I use -- put a new file near the old one, the files have the same name but different suffixes denoting files` version. Now the layout is the following:

Click to copy

hdfs:/spool/.../logs/2012-09-26.09.00.log.v1
hdfs:/spool/.../logs/2012-09-26.09.00.log.v2
hdfs:/spool/.../logs/2012-09-26.09.00.log.v3
hdfs:/spool/.../logs/2012-09-26.10.00.log.v1
hdfs:/spool/.../logs/2012-09-26.10.00.log.v2

Any Hadoop-application during it's start (setup) chooses the files with the most up-to-date versions and works with them. So even if some update is going on, the application will not experience any problems because no input file is removed.

Questions:

Do you know some easier approach to this problem which does not use this complicated/ugly file versioning?
Some applications may start using a HDFS-file which is currently uploading, but not yet uploaded (applications see this file in HDFS but don't know if it consistent). In case of gzip files this may lead to failed mappers. Could you please advice how could I handle this issue? I know that for local file systems I can do something like:

cp infile /finaldir/outfile.tmp && mv /finaldir/output.tmp /finaldir/output

This works because mv is an atomic operation, however I'm not sure that this is the case for HDFS. Could you please advice if HDFS has some atomic operation like mv in conventional local file systems?

Thanks in advance!

665

asked Sep 26 '12 21:09

Mikhail Shevelev

1 Answers

IMO, the file rename approach is absolutely fine to go with.

HDFS, upto 1.x, lacks atomic renames (they are dirty updates IIRC) - but the operation has usually been considered 'atomic-like' and never given problems to the specific scenario you have in mind here. You could rely on this without worrying about a partial state since the source file is already created and closed.

HDFS 2.x onwards supports proper atomic renames (via a new API call) that has replaced the earlier version's dirty one. It is also the default behavior of rename if you use the FileContext APIs.

182

answered Sep 28 '22 05:09

Harsh J

Related questions
                            
                                MapReduce job in headless environment fails N times due to AM Container exception from container-launch
                            
                                JVM crashes with no frame specified, only "timer expired, abort"
                            
                                How to insert data into Parquet table in Hive
                            
                                hdfs log file is too huge
                            
                                Cannot validate serde : org.openx.data.jsonserde.jsonserde
                            
                                Resources/Documentation on how does the failover process work for the Spark Driver (and its YARN Container) in yarn-cluster mode
                            
                                Python package installation: pip vs yum, or both together?
                            
                                Jackson throwing errors
                            
                                HTableDescriptor(table) in hbase is deprecated and alternative for that?
                            
                                Join Tables on Date Range in Hive
                            
                                Hive Utf-8 Encoding number of characters supported?
                            
                                Data shuffle for Hive and Spark window function
                            
                                Read large mongodb data
                            
                                YARN applications cannot start when specifying YARN node labels
                            
                                Hive explain plan understanding
                            
                                Hadoop dfs -ls returns list of files in my hadoop/ dir
                            
                                javax.management.InstanceAlreadyExistsException when using hadoop MiniDFSCluster
                            
                                HDFS vs GridFS: When to use which?
                            
                                running Hadoop software on office computers (when they are idle)
                            
                                Passing date as command line arguments in Hive

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

atomic hadoop fs move

Tags:

atomic

hadoop

hdfs

infrastructure