Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Get line count for all different file types in a repository

Tags:

git

bash

How would I get the line count for all the different file types in my repository? So for example, if my repository contained 3 file types:

  • java
  • xml
  • (no file extension)

I want the output to be something like this:

java 150
xml 20
(no file extension) 30

I can run a command that will retrieve the line count for a specific file type (git ls-files | grep "\.java$" | xargs cat | wc -l), but assuming I don't know what all the file types are in my repository, how would I go about retrieving them all with their respective line counts?

like image 571
digiarnie Avatar asked Nov 05 '13 23:11

digiarnie


2 Answers

This is really a Bash question: How to count the number of lines in a list of files grouped by filename extension?

Here's a lazy way using awk:

git ls-files | xargs -n100 wc -l | awk -F ' +|\\.' \
    '/\./ { sumlines[$NF] += $2 }
     END { for (ext in sumlines) print ext, sumlines[ext] }'

The key points:

  • git ls-files gives you the list of files in the repository.
  • xargs takes the list of files from its standard input and runs wc -l on them
    • The -n100 flag is to pass to wc -l maximum 100 files in a single call. wc -l will be called as many times as the number of files in the repository divided by 100.
  • awk does the heavy lifting of summing and aggregating the number of lines per filename extension
    • -F ' +|\\.' specifies the field separator: spaces or a dot. The idea is that the output of wc -l contains lines starting with spaces, followed by the number of lines, followed by a space, and followed by the filename. By using this as the separator, the 2nd field will be the number of lines, and the last field will be the filename extension. This will be useful for counting and aggregating.
    • In /\./ { sumlines[$NF] += $2 }, $NF is the value of the last field, in this example the filename extension, and $2 is the number of lines, as mentioned earlier. That is, we sum the number of lines per extension. The /\./ filter excludes lines in the input that don't have a .. The main reason for this is to exclude the line with the total from the output of wc -l.
    • The END block prints the filename extensions and their total counts

It's lazy because it won't work with files that contain newline characters, and it doesn't count the lines in files with no extension.

like image 134
janos Avatar answered Oct 16 '22 22:10

janos


Note: After reconsidering this, I really think janos' is the correct answer to the question asked. As it really provides the line count and not the file count, like my solution.


Using janos' solution gave me the following error (I'm using it on a quite big project):

xargs: wc: Argument list too long

So I came up with the following solution (maybe not most elegant, but does the trick even on big projects):

git ls-files | awk -F . '{print $NF}' | sort | uniq -c | sort -n -r | awk '{print $2,$1}' | head -10

This basically consists of the following steps (may be modified by your needs)

  • list all files known to git by git ls-files
  • use awk to get all the filetypes of the files
  • sort them
  • make them unique and get the occurrences count
  • sort them reversed (filetype with most occurrences on top)
  • print them using awk ($1 = count, $2 = filetype)
  • strip the list to the top 10 using head
like image 27
d4Rk Avatar answered Oct 17 '22 00:10

d4Rk