How would I get the line count for all the different file types in my repository? So for example, if my repository contained 3 file types:
I want the output to be something like this:
java 150
xml 20
(no file extension) 30
I can run a command that will retrieve the line count for a specific file type (git ls-files | grep "\.java$" | xargs cat | wc -l
), but assuming I don't know what all the file types are in my repository, how would I go about retrieving them all with their respective line counts?
This is really a Bash question: How to count the number of lines in a list of files grouped by filename extension?
Here's a lazy way using awk
:
git ls-files | xargs -n100 wc -l | awk -F ' +|\\.' \
'/\./ { sumlines[$NF] += $2 }
END { for (ext in sumlines) print ext, sumlines[ext] }'
The key points:
git ls-files
gives you the list of files in the repository.xargs
takes the list of files from its standard input and runs wc -l
on them
-n100
flag is to pass to wc -l
maximum 100 files in a single call. wc -l
will be called as many times as the number of files in the repository divided by 100.awk
does the heavy lifting of summing and aggregating the number of lines per filename extension
-F ' +|\\.'
specifies the field separator: spaces or a dot. The idea is that the output of wc -l
contains lines starting with spaces, followed by the number of lines, followed by a space, and followed by the filename. By using this as the separator, the 2nd field will be the number of lines, and the last field will be the filename extension. This will be useful for counting and aggregating./\./ { sumlines[$NF] += $2 }
, $NF
is the value of the last field, in this example the filename extension, and $2
is the number of lines, as mentioned earlier. That is, we sum the number of lines per extension. The /\./
filter excludes lines in the input that don't have a .
. The main reason for this is to exclude the line with the total from the output of wc -l
.END
block prints the filename extensions and their total countsIt's lazy because it won't work with files that contain newline characters, and it doesn't count the lines in files with no extension.
Note: After reconsidering this, I really think janos' is the correct answer to the question asked. As it really provides the line count and not the file count, like my solution.
Using janos' solution gave me the following error (I'm using it on a quite big project):
xargs: wc: Argument list too long
So I came up with the following solution (maybe not most elegant, but does the trick even on big projects):
git ls-files | awk -F . '{print $NF}' | sort | uniq -c | sort -n -r | awk '{print $2,$1}' | head -10
This basically consists of the following steps (may be modified by your needs)
git ls-files
awk
to get all the filetypes of the filessort
themsort
them reversed (filetype with most occurrences on top)awk
($1
= count, $2
= filetype)head
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With