I came across a question in my interview
Shell script to show frequency of each word in file and in a directory
A
- A1
- File1.txt
- File2.txt
-A2
- FileA21.txt
-A3
- FileA31.txt
- FileA32.txt
B
-B1
- FileB11.txt
- FileB12.txt
- FileB13.txt
-B2
-FileB21.txt
I believe that I understood the question by understanding that Directories A and B are two separate directories with A1, A2 & A3 being sub-directories of A, and B1 and B2 being sub-directories of B. So I answered like this.
Find . ‘\(-name “A” –and –name “B”\)’ –type f –exec cat ‘{}’ \; | awk ‘{c[$1]++} END {for (i in c) print i, c[i]}’
But still I got an feedback that the above script was not good enough. What's wrong in the given script?
The major limitation is that the script assumes there is exactly one word per line. c[$1]++
just increments the occurrence of the first field of each line.
The question didn't mention anything about the number of words in a line, so I'd assume this wasn't the intention - you need to go through each word in a line. Also, what about empty lines? With an empty line, $1
will be the empty string, so your script will end up counting "empty" words (which it will happily show as part of the output).
In awk, the number of fields in a line is stored in the built-in variable NF
; thus it is easy to write code to loop through the words and increment the corresponding count (and it has the nice side effect of implicitly ignoring lines without words).
So, I would do something like this instead:
find . -type f -exec cat '{}' \; | awk '{ for (i = 1; i <= NF; i++) w[$i]++ } END { for (i in w) printf("%-10s %10d\n", i, w[i]) }'
I removed the directory names constraints in the argument to find(1)
for the sake of conciseness, and to make it more general.
This is (probably) the main issue with your solution, but the question is (intentionally) vague and there are many things left to discuss:
FWIW, always remember that your success in an interview is not a binary yes/no. It's not like: Oops, you can't do X, so I'm going to reject you. Or: Oops, wrong answer, you're out. More important than the answer is the process that gets you there, and whether or not you are aware of (a) the assumptions you made; and (b) your solution's limitations. The questions above show ability to consider edge cases, ability to clarify assumptions and requirements, etc, which is way more important than getting the "right" script (and probably there's no such thing as The Right Script).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With