Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Recursive search grep

Tags:

linux

grep

hdfs

I'm trying to search through HDFS for parquet files and list them out. I'm using this, which works great. It looks through all of the subdirectories in /sources.works_dbo and gives me all the parquet files:

 hdfs dfs -ls -R /sources/works_dbo | grep ".*\.parquet$"

However; I just want to return the first file it encounters per subdirectory, so that each subdirectory only appears on a single line in my output. Say I had this:

sources/works_dbo/test1/file1.parquet
sources/works_dbo/test1/file2.parquet
sources/works_dbo/test2/file3.parquet

When I run my command I expect the output to look like this:

sources/works_dbo/test1/file1.parquet
sources/works_dbo/test2/file3.parquet
like image 260
jymbo Avatar asked Jun 13 '26 06:06

jymbo


2 Answers

... | awk '!seen[gensub(/[^/]+$/,"",1)]++' file
sources/works_dbo/test1/file1.parquet
sources/works_dbo/test2/file3.parquet

The above uses GNU awk for gensub(), with other awks you'd use a variable and sub():

awk '{path=$0; sub(/[^/]+$/,"",path)} !seen[path]++'

It will work for any mixture of any length of paths.

like image 84
Ed Morton Avatar answered Jun 16 '26 06:06

Ed Morton


You can use sort -u (unique) with / as the delimiter and using the first three fields as key. The -s option ("stable") makes sure that the file retained is the first one encountered for each subdirectory.

For this input

sources/works_dbo/test1/file1.parquet
sources/works_dbo/test1/file2.parquet
sources/works_dbo/test2/file3.parquet

the result is

$ sort -s -t '/' -k 1,3 -u infile
sources/works_dbo/test1/file1.parquet
sources/works_dbo/test2/file3.parquet
like image 37
Benjamin W. Avatar answered Jun 16 '26 06:06

Benjamin W.



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!