Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Copy files to local from multiple directories in HDFS for last 24 hours

Tags:

bash

hadoop

hdfs

I have a problem with getting data from HDFS to local. I have for example:

/path/to/folder/report1/report1_2019_03_24-03_10*.csv
/path/to/folder/report1/report1_2019_03_24-04_12*.csv
...
/path/to/folder/report1/report1_2019_03_25-05_12*.csv
/path/to/folder/report1/report1_2019_03_25-06_12*.csv
/path/to/folder/report1/report1_2019_03_25-07_11*.csv
/path/to/folder/report1/report1_2019_03_25-08_13*.csv
/path/to/folder/report2/report2_out_2019_03_25-05_12*.csv
/path/to/folder/report2/report2_out_2019_03_25-06_11*.csv
/path/to/folder/report3/report3_TH_2019_03_25-05_12*.csv

So I need enter in each of these folders (report1, report2, report3... But not all of them starts with "report") and then CSV files that are from previous 24hour copy to local and that should be done each morning at 4 am (I can schedule that with crontab). The problem is that I don't know how to iterate over file and pass timestamp as an argument.

I have tried with something like this (found on Stack Overflow)

/datalake/hadoop/bin/hadoop fs -ls /path/to/folder/report1/report1/*    |   tr -s " "    |    cut -d' ' -f6-8    |     grep "^[0-9]"    |    awk 'BEGIN{ MIN=1440; LAST=60*MIN; "date +%s" | getline NOW } { cmd="date -d'\''"$1" "$2"'\'' +%s"; cmd | getline WHEN; DIFF=NOW-WHEN; if(NOW > DIFF){ print "Migrating: "$3; system("datalake/hadoop/bin/hadoop fs -copyToLocal /path/to/local_dir/"$3) }}'

But this one is copying files older than I few days and it's copying only files from one directory (in this case report1).

Is there any way to make this more flexible and correct. It would be great if this can be solver with bash, not with Python. Any suggestion is welcomed or link to a good answer with a similar problem.

Also, it's not necessary to be in some loop. It's OK for me to use the separated code line for each report.

like image 850
jovicbg Avatar asked Mar 26 '19 18:03

jovicbg


People also ask

How do I copy multiple files from HDFS to local?

Hadoop Get command is used to copy files from HDFS to the local file system, use Hadoop fs -get or hdfs dfs -get , on get command, specify the HDFS-file-path where you wanted to copy from and then local-file-path where you wanted a copy to the local file system. Copying files from HDFS file to local file system.

How do I copy files from one folder to another in HDFS?

You can use the cp command in Hadoop. This command is similar to the Linux cp command, and it is used for copying files from one directory to another directory within the HDFS file system.

Which command of HDFS helps copy files from HDFS to local file system?

copyToLocal: Copy files from HDFS to local file system, similar to -get command.

How do I copy files from local file system to Hadoop HDFS?

In order to copy a file from the local file system to HDFS, use Hadoop fs -put or hdfs dfs -put, on put command, specify the local-file-path where you wanted to copy from and then HDFS-file-path where you wanted to copy to. If the file already exists on HDFS, you will get an error message saying “File already exists”.


1 Answers

note: I was unable to test this, but you could test this step by step by looking at the output:

Normally I would say Never parse the output of ls, but with Hadoop, you don't have a choice here as there is no equivalent to find. (Since 2.7.0 there is a find, but it is very limited according to the documentation)

Step 1: recursive ls

$ hadoop fs -ls -R /path/to/folder/

Step 2: use awk to pick files only and CSV files only
directories are recognized by their permissions that start with d, so we have to exclude those. And the CSV files are recognized by the last field ending with "csv":

$ hadoop fs -ls -R /path/to/folder/ | awk '!/^d/ && /\.csv$/'

make sure you do not end up with funny lines here which are empty or just the directory name ...

Step 3: continue using awk to process the time. I am assuming you have any standard awk, so I will not use GNU extensions. Hadoop will output the time format as yyyy-MM-dd HH:mm. This format can be sorted and is located in fields 6 and 7:

$ hadoop fs -ls -R /path/to/folder/  \
   | awk -v cutoff="$(date -d '-24 hours' '+%F %H:%M')" \
         '(!/^d/) && /\.csv$/ && (($6" "$7) > cutoff)'

Step 4: Copy files one by one:

First, check the command you are going to execute:

$ hadoop fs -ls -R /path/to/folder/  \
   | awk -v cutoff="$(date -d '-24 hours' '+%F %H:%M')" \
         '(!/^d/) && /\.csv$/ && (($6" "$7) > cutoff) {
            print "migrating", $NF
            cmd="hadoop fs -get "$NF" /path/to/local/"
            print cmd
            # system(cmd)
         }'

(remove # if you want to execute)

or

$ hadoop fs -ls -R /path/to/folder/  \
   | awk -v cutoff="$(date -d '-24 hours' '+%F %H:%M')" \
         '(!/^d/) && /\.csv$/ && (($6" "$7) > cutoff) {
            print $NF
         }' | xargs -I{} echo hadoop fs -get '{}' /path/to/local/

(remove echo if you want to execute)

like image 183
kvantour Avatar answered Nov 11 '22 16:11

kvantour