Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

GitHub Archive Google Big Query repositories language information for 2015

I have a problem with retrieving language information from GitHub Archive Google BigQuery since the structure of the tables changed which was at the beginning of 2015.

When querying github_timeline table I have a field named repository_language. It allows me to get my language statistics. Unfortunately for 2015 the structure has changed and the table doesn't contain any events after 2014. For example the following query doesn't return any data:

select
repository_language, repository_url, created_at
FROM [githubarchive:github.timeline]
where
PARSE_UTC_USEC(created_at) > PARSE_UTC_USEC('2015-01-02 00:00:00')

Events for 2015 are in: githubarchive:month & githubarchive:day tables. None of them have language information tho (or at least repository_language column).

Can anyone help me?

like image 523
Basia F Avatar asked Jan 28 '26 10:01

Basia F


2 Answers

Look at payload field
It is string that, I think, actually holds JSON with all "missing" attributes

You can process this using JSON Functions

enter image description here

Added Query

Try as below:

SELECT 
  JSON_EXTRACT_SCALAR(payload, '$.pull_request.head.repo.language') AS language,
  COUNT(1) AS usage
FROM [githubarchive:month.201601] 
GROUP BY language
HAVING NOT language IS NULL
ORDER BY usage DESC
like image 78
Mikhail Berlyant Avatar answered Jan 30 '26 22:01

Mikhail Berlyant


What Mikhail said + you can use a query like this:

SELECT JSON_EXTRACT_SCALAR(payload, '$.pull_request.base.repo.language') language, COUNT(*) c
FROM [githubarchive:month.201501]
GROUP BY 1
ORDER BY 2 DESC
like image 22
Felipe Hoffa Avatar answered Jan 31 '26 00:01

Felipe Hoffa