Microsoft logo
TU Delft logo

GHTorrent can be accessed over Google Cloud services. To access the data requires you to have a Google Cloud account. Reasonable use is free of charge and, in the case of BigQuery, it should no longer require a credit card. (Pub/Sub still requires a credit card). You can check what Google considers reasonable at any given moment here.

Both services can be accessed through the Web, the command line (after installing the Google Cloud command line utils) or though various programming languages.


With BigQuery, you can query GHTorrent's MySQL dataset using an SQL-like language (lately, BigQuery also supports vanilla SQL); more importantly, you can join the dataset with other open datasets (e.g. GitHub's own project data, Reddit, TravisTorrent etc) hosted on BigQuery.

To get the most popular programming languages by number of bytes written, run the following:

select pl3.lang, sum(pl3.size) as total_bytes
from (
  select pl2.bytes as size, pl2.language as lang
  from (
    select pl.language as lang, max(pl.created_at) as latest, pl.project_id as project_id
    from [ght.project_languages] pl
      join [ght.projects] p on p.id = pl.project_id
    where p.deleted is false
      and p.forked_from is null
    group by lang, project_id
  ) pl1 join [ght.project_languages] pl2 on pl1.project_id = pl2.project_id
                                        and pl1.latest = pl2.created_at
                                        and pl1.lang = pl2.language
) pl3
group by pl3.lang
order by total_bytes desc

To get the user with the most Java commits in the Netherlands in June 2016, do the following:

select u.login as login, u.location as location, count(c.id) as num_commits
from [ghtorrent-bq.ght.project_commits] pc join
     (SELECT id, author_id FROM [ghtorrent-bq.ght.commits] WHERE
     date(created_at) between date('2016-06-01')
                          and date('2016-07-01') )c on pc.commit_id = c.id join
     (SELECT id
     FROM [ghtorrent-bq.ght.projects] WHERE language = 'Java') p on p.id = pc.project_id join
     (SELECT login, location, id
     FROM [ghtorrent-bq.ght.users]
     WHERE country_code = 'nl') u on c.author_id = u.id,
group by login, location
order by num_commits desc;

See also some queries by Felipe Hoffa.


Pub/Sub allows subscribers to get events of what is happening on GitHub (or at least GHTorrent's interpretation of what is happening on GitHub) in almost real time. To do so, one needs to subscribe to one of the available topics with a client in order to start receiving events.

The service is complimentary, even though less fine-grained, to GHTorrent's own streaming interface. As is also the case with GHTorrent streaming, the contents of the streams are generated by following the live MongoDB server replication stream. See the code here.

To subscribe to a topic, e.g. commits, run the following:

gcloud beta pubsub subscriptions create my_commits_subscription --topic projects/ghtorrent-bq/topics/commits

To start receiving events, you can try the command line

gcloud beta pubsub subscriptions pull --auto-ack --max-messages 5 -- my_commits_subscription

The available topics are the following: