Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Shell script to show frequency of each word in file and in a directory

Tags:

bash

shell

awk

I came across a question in my interview

Shell script to show frequency of each word in file and in a directory

A
    - A1
        - File1.txt
        - File2.txt
    -A2
        - FileA21.txt
    -A3
        - FileA31.txt
        - FileA32.txt
B
    -B1
        - FileB11.txt
        - FileB12.txt
        - FileB13.txt
    -B2
        -FileB21.txt

I believe that I understood the question by understanding that Directories A and B are two separate directories with A1, A2 & A3 being sub-directories of A, and B1 and B2 being sub-directories of B. So I answered like this.

Find . ‘\(-name “A” –and –name “B”\)’ –type f –exec cat ‘{}’ \; | awk ‘{c[$1]++} END {for (i in c) print i, c[i]}’

But still I got an feedback that the above script was not good enough. What's wrong in the given script?

like image 864
user3624000 Avatar asked Sep 27 '22 21:09

user3624000


1 Answers

The major limitation is that the script assumes there is exactly one word per line. c[$1]++ just increments the occurrence of the first field of each line.

The question didn't mention anything about the number of words in a line, so I'd assume this wasn't the intention - you need to go through each word in a line. Also, what about empty lines? With an empty line, $1 will be the empty string, so your script will end up counting "empty" words (which it will happily show as part of the output).

In awk, the number of fields in a line is stored in the built-in variable NF; thus it is easy to write code to loop through the words and increment the corresponding count (and it has the nice side effect of implicitly ignoring lines without words).

So, I would do something like this instead:

find . -type f -exec cat '{}' \; | awk '{ for (i = 1; i <= NF; i++) w[$i]++ } END { for (i in w) printf("%-10s %10d\n", i, w[i]) }'

I removed the directory names constraints in the argument to find(1) for the sake of conciseness, and to make it more general.

This is (probably) the main issue with your solution, but the question is (intentionally) vague and there are many things left to discuss:

  • Is it case-sensitive? This solution treats World and world as different words. Is this desired?
  • What about punctuation? Should hello and hello! be treated as the same word? What about commas? That is, do we need to parse and ignore punctuation?
  • Speaking of which - what about things like what's vs. what? Do we consider them different words? And it's vs. its? English is tricky!
  • Most important of all (and related to the points above), what exactly defines a word? We assumed a word is a sequence of non-blanks (the default in awk). Is this accurate?
  • If there are no words in the input, what do we do? This solution prints nothing - maybe we should print a warning message?
  • Is there a fixed number of words in a line? Or is it arbitrary? (E.g. if there's exactly one word per line, your solution would be enough)

FWIW, always remember that your success in an interview is not a binary yes/no. It's not like: Oops, you can't do X, so I'm going to reject you. Or: Oops, wrong answer, you're out. More important than the answer is the process that gets you there, and whether or not you are aware of (a) the assumptions you made; and (b) your solution's limitations. The questions above show ability to consider edge cases, ability to clarify assumptions and requirements, etc, which is way more important than getting the "right" script (and probably there's no such thing as The Right Script).

like image 87
Filipe Gonçalves Avatar answered Sep 30 '22 06:09

Filipe Gonçalves