I'm trying to search through HDFS for parquet files and list them out. I'm using this, which works great. It looks through all of the subdirectories in /sources.works_dbo and gives me all the parquet files:
hdfs dfs -ls -R /sources/works_dbo | grep ".*\.parquet$"
However; I just want to return the first file it encounters per subdirectory, so that each subdirectory only appears on a single line in my output. Say I had this:
sources/works_dbo/test1/file1.parquet
sources/works_dbo/test1/file2.parquet
sources/works_dbo/test2/file3.parquet
When I run my command I expect the output to look like this:
sources/works_dbo/test1/file1.parquet
sources/works_dbo/test2/file3.parquet
... | awk '!seen[gensub(/[^/]+$/,"",1)]++' file
sources/works_dbo/test1/file1.parquet
sources/works_dbo/test2/file3.parquet
The above uses GNU awk for gensub(), with other awks you'd use a variable and sub():
awk '{path=$0; sub(/[^/]+$/,"",path)} !seen[path]++'
It will work for any mixture of any length of paths.
You can use sort -u (unique) with / as the delimiter and using the first three fields as key. The -s option ("stable") makes sure that the file retained is the first one encountered for each subdirectory.
For this input
sources/works_dbo/test1/file1.parquet
sources/works_dbo/test1/file2.parquet
sources/works_dbo/test2/file3.parquet
the result is
$ sort -s -t '/' -k 1,3 -u infile
sources/works_dbo/test1/file1.parquet
sources/works_dbo/test2/file3.parquet
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With