Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How does github figure out a project's language?

I was recently working on a github project in both JavaScript and C++, and noticed that github tagged the project as C++. If you have to pick a single language, this is probably the correct designation since the C++ code is compiled as a JavaScript library, but this made me wonder... how does github figure out what language to tag each project?

like image 570
Justin Ethier Avatar asked Mar 15 '11 21:03

Justin Ethier


People also ask

How does GitHub detect programming language?

GitHub uses the open source Linguist library to determine file languages for syntax highlighting and repository statistics. Language statistics will update after you push changes to your default branch.

How do I change the project language in GitHub?

You cannot change the language of the repository, but you can change the attributes of the github repository. I mean that if you have a project where there are 60% css and 40% javascript you can said to github-linguist, that you want to ignore the css file. this attributes ignore the java files.


1 Answers

Update April 2013, by nuclearsandwich (GitHub support team or "supportocat"):

  • the help page "My repository is marked as the wrong language" mentions using now the linguist library to determine file language for syntax highlighting and repo statistics. Linguist will exclude certain file names and paths from statistic, excluding certain vendor files and directories.

  • the help page "Why isn't my favorite language recognized?" adds:

If your desired language is not receiving syntax highlighting you can contribute to the Linguist library to add it.


(Original answer, Oct. 2012)

This thread on GitHub support explains it:

It just sums up file sizes for each extension. Largest one "wins".

We'd like to avoid opening files up and parsing their content, as both would slow down the process... but that might be the only method of resolving conflicts like this one.

Since this is not 100% accurate, that had lead some to add:

I, too, would vote for a simple manual-override switch for the cases where the guess is wrong.


Note: as Mark Rushakoff mentions in his answer (upvoted), the guessing got better since then with the linguist project (open-sourced from June 2011).
You can see there are still issues though: GitHub Linguist Issues.
See here for more details:

Once the language has been detected, it is passed to Albino, a Pygments wrapper, which does the actual syntax highlighting.

And you can add linguist directives in a .gitattributes file.

like image 130
VonC Avatar answered Sep 20 '22 02:09

VonC