The GHTorrent project
Welcome to the GHTorrent project, an effort to create a scalable, queriable, offline mirror of data offered through the Github REST API.
What does GHTorrent do?
GHTorrent monitors the Github public event time line. For each event, it retrieves its contents and their dependencies, exhaustively. It then stores the raw JSON responses to a MongoDB database, while also extracting their structure in a MySQL database.
GHTorrent works in a distributed manner. A RabbitMQ message queue sits between the event mirroring and data retrieval phases, so that both can be run on a cluster of machines. Have a look at this presentation and read this paper if you want to know more. Here is the source code.
The project releases the data collected during that period as downloadable archives.
How much data do you have?
Currently (Jan 2015), MongoDB stores around 4TB of JSON data (compressed), while MySQL more than 1.5 billion rows of extracted metadata. A large part of the activity of 2012, 2013, 2014 and 2015 has been retrieved, while we are also going backwards to retrieve the full recorded history of important projects.
How can I help?
GHTorrent needs contributions on the following fronts:
API keys: We can run multiple GHTorrent worker instances concurrently. To go over Github’s API rate limit, we need multiple Github API keys provided by users. If you use GHTorrent for your reseach, please consider donating a key.
Linking and analysis: GHTorrent currently only does limited analysis and linking withing the the dataset (user geolocation). There are many possibilities for expansion. One could for example think of linking commits to issues.
Reporting bugs: Please use Github’s issue tracker here to report any data consistency issues you have found.
Donating: We are trying to make GHTorrent a self-sustainable operation. If you are using GHTorrent, please consider donating (you can find a donation button on the left). All individual/companies that have donated will be listed in the Hall of Fame page.
Why did you do it?
We are doing research on software repositories. Github is an exciting new data source for us, one that has several of the problems we are facing as data miners solved. The uniformity of data will allow scaling of research to hundreds or thousands of repositories spanning across multiple languages and application domains.
Why the name?
Initially the project offered the data through the Bittorrent network (gh: from GitHub, torrent: from Bittorrent). As currently the data is only offered through HTTP, the name signifies a torrent of data coming from GitHub.
Can I know more?
Have a look at the following presentation for a short introduction.
How can I cite this work?
If you find this dataset useful and want to use it in your work, please cite the following paper:
Georgios Gousios: The GHTorrent dataset and tool suite. MSR 2013: 233-236