While building an infrastructure for one of my current projects I've faced the problem of replacement of already existing HDFS files. More precisely, I want to do the following:
We have a few machines (log-servers) which are continuously generating logs. We have a dedicated machine (log-preprocessor) which is responsible for receiving log chunks (each chunk is about 30 minutes in length and 500-800 mb in size) from log-servers, preprocessing them and uploading to HDFS of our Hadoop-cluster.
Preprocessing is done in 3 steps:
Final logfiles are to be used as input for several periodoc HADOOP-applications which are running on a HADOOP-cluster. In HDFS logfiles are stored as follows:
hdfs:/spool/.../logs/YYYY-MM-DD.HH.MM.log
The mapping which is used on step 3 changes over time and we need to reflect these changes by recalculating step3 and replacing old HDFS files with new ones. This update is performed with some periodicity (e.g. every 10-15 minutes) at least for last 12 hours. Please note that, if the mapping has changed, the result of applying step3 on the same input file may be significantly different (it will not be just a superset/subset of previous result). So we need to overwrite existing files in HDFS.
However, we can't just do hadoop fs -rm and then hadoop fs -copyToLocal because if some HADOOP-application is using the file which is temporary removed the app may fail. The solution I use -- put a new file near the old one, the files have the same name but different suffixes denoting files` version. Now the layout is the following:
hdfs:/spool/.../logs/2012-09-26.09.00.log.v1
hdfs:/spool/.../logs/2012-09-26.09.00.log.v2
hdfs:/spool/.../logs/2012-09-26.09.00.log.v3
hdfs:/spool/.../logs/2012-09-26.10.00.log.v1
hdfs:/spool/.../logs/2012-09-26.10.00.log.v2
Any Hadoop-application during it's start (setup) chooses the files with the most up-to-date versions and works with them. So even if some update is going on, the application will not experience any problems because no input file is removed.
Do you know some easier approach to this problem which does not use this complicated/ugly file versioning?
Some applications may start using a HDFS-file which is currently uploading, but not yet uploaded (applications see this file in HDFS but don't know if it consistent). In case of gzip files this may lead to failed mappers. Could you please advice how could I handle this issue? I know that for local file systems I can do something like:
cp infile /finaldir/outfile.tmp && mv /finaldir/output.tmp /finaldir/output
This works because mv is an atomic operation, however I'm not sure that this is the case for HDFS. Could you please advice if HDFS has some atomic operation like mv in conventional local file systems?
Thanks in advance!
You can use the cp command in Hadoop. This command is similar to the Linux cp command, and it is used for copying files from one directory to another directory within the HDFS file system.
to do this on continuous basis you either need to setup an Oozie job that will run a script to determine old data and move it to new location. Alternatively, you can use Apache Nifi by watching a directory for old data and move it to new location. There's nothing out of the box that will do that for you.
Although HDFS offers atomic recursive directory deletion, none of the other Hadoop FileSystems offer such a guarantee (including local FileSystems).
There is no cd (change directory) command in hdfs file system. You can only list the directories and use them for reaching the next directory. You have to navigate manually by providing the complete path using the ls command.
IMO, the file rename approach is absolutely fine to go with.
HDFS, upto 1.x, lacks atomic renames (they are dirty updates IIRC) - but the operation has usually been considered 'atomic-like' and never given problems to the specific scenario you have in mind here. You could rely on this without worrying about a partial state since the source file is already created and closed.
HDFS 2.x onwards supports proper atomic renames (via a new API call) that has replaced the earlier version's dirty one. It is also the default behavior of rename if you use the FileContext APIs.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With