I'm attempting to measure programming language popularity via:
Conveniently, there is a massive trove of Github data provided by Github Archive, and hosted by BigQuery. The only problem is that I don't see "language" available in any of the payloads for the various event types in Github Archive.
Here's the BigQuery query I've been running trying to find if, and where, language may be populated in the Github Archive data:
SELECT *
FROM [githubarchive:month.201612]
WHERE JSON_EXTRACT(payload, "$.repository.language") is null
LIMIT 100
Can someone please provide insight into whether I'll be able to utilize Github Archive data in this way, and how I can go about doing so? Or will I need to pursue some other approach? I see that there is also a github_repos public dataset on BigQuery, and it does have some language metrics, but the languages metrics seem to be over all time. I'd prefer to get some sort of monthly metric eventually (i.e., of "active" repos in a given month, what were the most popular languages).
Any advice is appreciated!
With BigQuery and GitHub Archive and GHTorrent -
To get the languages by pull requests, last December (copy pasted from http://mads-hartmann.com/2015/02/05/github-archive.html):
SELECT COUNT(*) c, JSON_EXTRACT_SCALAR(payload, '$.pull_request.base.repo.language') lang
FROM [githubarchive:month.201612]
WHERE JSON_EXTRACT_SCALAR(payload, '$.pull_request.base.repo.language') IS NOT NULL
GROUP BY 2
ORDER BY 1 DESC
LIMIT 10
To find the number of stars per project:
SELECT COUNT(*) c, repo.name
FROM [githubarchive:month.201612]
WHERE type='WatchEvent'
GROUP BY 2
ORDER BY 1 DESC
LIMIT 10
For a quick language vs bytes view, you can use GHTorrent:
SELECT language, SUM(bytes) bytes
FROM [ghtorrent-bq:ght.project_languages]
GROUP BY 1
ORDER BY 2 DESC
LIMIT 10
Or to look at the actual files, see the contents of GitHub on BigQuery.
Now you can mix these queries to get the results you want!
SELECT
JSON_EXTRACT_SCALAR(payload, '$.pull_request.head.repo.language') AS language,
COUNT(1) AS usage
FROM [githubarchive:month.201601]
GROUP BY language
HAVING NOT language IS NULL
ORDER BY usage DESC
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With