Recursive search grep

Question

I'm trying to search through HDFS for parquet files and list them out. I'm using this, which works great. It looks through all of the subdirectories in /sources.works_dbo and gives me all the parquet files:

 hdfs dfs -ls -R /sources/works_dbo | grep ".*\.parquet$"

However; I just want to return the first file it encounters per subdirectory, so that each subdirectory only appears on a single line in my output. Say I had this:

sources/works_dbo/test1/file1.parquet
sources/works_dbo/test1/file2.parquet
sources/works_dbo/test2/file3.parquet

When I run my command I expect the output to look like this:

sources/works_dbo/test1/file1.parquet
sources/works_dbo/test2/file3.parquet

Ed Morton · Accepted Answer

... | awk '!seen[gensub(/[^/]+$/,"",1)]++' file
sources/works_dbo/test1/file1.parquet
sources/works_dbo/test2/file3.parquet

The above uses GNU awk for gensub(), with other awks you'd use a variable and sub():

awk '{path=$0; sub(/[^/]+$/,"",path)} !seen[path]++'

It will work for any mixture of any length of paths.

Benjamin W. · Answer

You can use sort -u (unique) with / as the delimiter and using the first three fields as key. The -s option ("stable") makes sure that the file retained is the first one encountered for each subdirectory.

For this input

sources/works_dbo/test1/file1.parquet
sources/works_dbo/test1/file2.parquet
sources/works_dbo/test2/file3.parquet

the result is

$ sort -s -t '/' -k 1,3 -u infile
sources/works_dbo/test1/file1.parquet
sources/works_dbo/test2/file3.parquet

Recursive search grep

Tags:

linux

grep

hdfs

jymbo

2 Answers

Ed Morton

Benjamin W.

Recent Activity

Donate For Us

Recursive search grep

Tags:

linux

grep

hdfs

jymbo

2 Answers

Ed Morton

Benjamin W.

Related questions

Recent Activity

Donate For Us