I am working with Hadoop and I need to find which of ~100 files in my Hadoop filesystem contain a certain string.
I can see the files I wish to search like this:
bash-3.00$ hadoop fs -ls /apps/mdhi-technology/b_dps/real-time
..which returns several entries like this:
-rw-r--r-- 3 b_dps mdhi-technology 1073741824 2012-07-18 22:50 /apps/mdhi-technology/b_dps/HADOOP_consolidated_RT_v1x0_20120716_aa
-rw-r--r-- 3 b_dps mdhi-technology 1073741824 2012-07-18 22:50 /apps/mdhi-technology/b_dps/HADOOP_consolidated_RT_v1x0_20120716_ab
How do I find which of these contains the string bcd4bc3e1380a56108f486a4fffbc8dc
? Once I know, I can edit them manually.
This is a hadoop "filesystem", not a POSIX one, so try this:
hadoop fs -ls /apps/hdmi-technology/b_dps/real-time | awk '{print $8}' | \
while read f
do
hadoop fs -cat $f | grep -q bcd4bc3e1380a56108f486a4fffbc8dc && echo $f
done
This should work, but it is serial and so may be slow. If your cluster can take the heat, we can parallelize:
hadoop fs -ls /apps/hdmi-technology/b_dps/real-time | awk '{print $8}' | \
xargs -n 1 -I ^ -P 10 bash -c \
"hadoop fs -cat ^ | grep -q bcd4bc3e1380a56108f486a4fffbc8dc && echo ^"
Notice the -P 10
option to xargs
: this is how many files we will download and search in parallel. Start low and increase the number until you saturate disk I/O or network bandwidth, whatever is relevant in your configuration.
EDIT: Given that you're on SunOS (which is slightly brain-dead) try this:
hadoop fs -ls /apps/hdmi-technology/b_dps/real-time | awk '{print $8}' | while read f; do hadoop fs -cat $f | grep bcd4bc3e1380a56108f486a4fffbc8dc >/dev/null && echo $f; done
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With