Given a folder with subfolders themselves with multilangual .txt files such as:
But where is Esope the holly Bastard
But where is 생 지 옥 이 군
지 옥 이
지 옥
지
我 是 你 的 爸 爸 !
爸 爸 ! ! !
你 不 會 的 !
I already know how to count space-separated word-frequency within ONE file.txt :
$ grep -o '\w*' myfile.txt | awk '{a[$1]++}END{for(k in a)print a[k],k}' | sort > myoutput.txt
Getting the elegant :
1 생
1 군
1 Bastard
1 Esope
1 holly
1 the
1 不
1 我
1 是
1 會
2 이
2 But
2 is
2 where
2 你
2 的
3 옥
4 지
4 爸
5 !
How to change the code to work on multiples files within a folder and its subfolders, all presenting a similar pattern ( *.txt at least) ?
You can use the find
command for that. Like this:
find -iname '*.txt' -exec cat {} \; | grep -o '\w*' | awk '{a[$1]++}END{for(k in a)print a[k],k}' | sort
I'm using the the option -exec
to cat every *.txt file in the current directory and it's subdirs. The output will get piped to your grep|awk|sort pipe.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With