MSR 2014 Mining Challenge Dataset
After the initial release of the dataset, the users found errors and missing features. The list of versions along with the fixes is presented in the table below. Only the latest version is offered for download.
You are advised to always run queries against the newest version. If you have already downloaded an older version and the described fix does not affect your experiment, you could skip the update.
|Version||Release date||Fixed error|
|1.3||13 Dec 2013||Missing project members for some projects is now fixed|
|1.2||22 Oct 2013||user_id in table commit_comments not set correctly.|
|1.1||9 Oct 2013||Table commit_comments was missing data. Some commits were missing from some projects.|
|1.0||28 Sep 2013|
The MSR 2014 challenge dataset is a (very) trimmed down version of the original GHTorrent dataset. It includes data from the top-10 starred software projects for the top programming languages on Github, which gives 90 projects and their forks. For each project, we retrieved all data including issues, pull requests organizations, followers, stars and labels (milestones and events not included). The dataset was constructed from scratch to ensure the latest information is in it.
Similarly to GHTorrent itself, the MSR challenge dataset comes in two flavours:
- A MongoDB database dump containing the results of querying the Github API. See format here.
- A MySQL database dump containing a queriable version of important fields extracted from the raw data. See schema here.
The included projects are the following:
akka/akka devtools/hadley ProjectTemplate/johnmyleswhite stat-cookbook/mavam hiphop-php/facebook knitr/yihui shiny/rstudio folly/facebook mongo/mongodb doom3.gpl/TTimo phantomjs/ariya TrinityCore/TrinityCore MaNGOS/mangos bitcoin/bitcoin mosh/keithw xbmc/xbmc http-parser/joyent beanstalkd/kr redis/antirez ccv/liuliu memcached/memcached openFrameworks/openframeworks libgit2/libgit2 redcarpet/vmg libuv/joyent SignalR/SignalR SparkleShare/hbons plupload/moxiecode mono/mono Nancy/NancyFx ServiceStack/ServiceStack AutoMapper/AutoMapper RestSharp/restsharp ravendb/ravendb MiniProfiler/SamSaffron storm/nathanmarz elasticsearch/elasticsearch ActionBarSherlock/JakeWharton facebook-android-sdk/facebook clojure/clojure CraftBukkit/Bukkit netty/netty android/github node/joyent jquery/jquery html5-boilerplate/h5bp impress.js/bartaz d3/mbostock chosen/harvesthq Font-Awesome/FortAwesome three.js/mrdoob foundation/zurb symfony/symfony CodeIgniter/EllisLab php-sdk/facebook zf2/zendframework cakephp/cakephp ThinkUp/ginatrapani phpunit/sebastianbergmann Slim/codeguy django/django tornado/facebook httpie/jkbr flask/mitsuhiko requests/kennethreitz symfony/xphere-forks reddit/reddit boto/boto django-debug-toolbar/django-debug-toolbar Sick-Beard/midgetspy django-cms/divio rails/rails homebrew/mxcl jekyll/mojombo gitlabhq/gitlabhq diaspora/diaspora devise/plataformatec blueprint-css/joshuaclayton octopress/imathis vinc.cc/vinc paperclip/thoughtbot compass/chriseppstein finagle/twitter kestrel/robey flockdb/twitter gizzard/twitter sbt/sbt scala/scala scalatra/scalatra zipkin/twitter
Importing and using
The following instructions assume an OSX or Linux based host.
Answers to frequently asked questions
Why a new dataset?
For practical reasons. The dataset is small enough to be used on a laptop, yet rich enough to do really interesting research with it.
What are the hardware requirements?
We have succesfully imported and used both dumps into a 2011 MacBookAir with 4GB of RAM. Your mileage may vary, but relatively new systems with more than 4GB RAM should have no trouble with both databases. If you only need to use the MySQL data dump, the hardware requirements are even lower.
Why two databases? Do I need both?
Not necessarily. The MySQL database can readily cover many aspects of activity on Github. Perhaps the only reason to use the MongoDB dump is to analyse commit contents, branches affected by pull requests or milestones, which are not included in MySQL.
How can I ask a question about the dataset?
Your question and the potential answer might be useful for other people as well, so please use the form below. Please note that I will not answer questions sent to my email.