Sharing your analytic code in open source epidemiology
I've become involved in the open source movement and recently wrote an article in the journal Epidemiology about how we can advance epidemiology by releasing our analytic codes. There are four main benefits to releasing code: 1) transparency, 2) reproducibility, 3) advancement of methods, and 4) education. As epidemiologists, we do not maintain a monopoly on our methods, and information should be free and accessible to all.
For those interested in following suit, I've prepared this brief tutorial. I'm using GitHub as the source code repository, and Zenodo as a way to make the work citable (through a digital object identifier, or DOI). I don't specifically endorse any open source platform, but these two have worked well for me and are respected in the community.
First, you'll need to obtain a free account on GitHub. Then, using your GitHub account, setup a corresponding Zenodo account, configured as documented in the "Login to Zenodo" section in this other blog post.
Before releasing your code, be sure to have redacted any proprietary or personal information, cleaned up your coded according to "coding best practices," and provided the citation to the published article in the source code so your audience can learn more about your work. You may also be interested in applying an open source license. I recommend the simple MIT license for all code being released (unless you're code inherits other code, in which case a GPL license is likely the best fit).
Once you're ready to share your code, follow these five steps:
[EDIT: Nov 1 2017]
As I originally wrote this, I did not consider the complication of publishing your code during the peer review process of the corresponding manuscript. Now that I have been through that process, I have a few tips/recommendations to share. These are mainly applicable for those using GitHub to archive the code and Zenodo to share the code through a DOI.
When releasing your initial code, I would set it as a "pre-release" in GitHub. There is a simple checkbox when creating a release that flags it as such. Then you include the release specific DOI in the manuscript. Behind the scenes, Zenodo has actually created two DOIs when you create the initial release: one that is specific to the release, the other that is generic to the repository. This generic one will always link to the most recent release you have on file. This is potentially useful if you just wish to have the manuscript always link to the most recent code. However, this also creates a potential problem, because in theory, the manuscript should link to the code that was used for that iteration of the manuscript. The alternative - and I think proper -solution is to create a release for each iteration of the manuscript/code. Suppose you get back the initial submission peer reviews and it is a revise-and-resubmit. If, during this process, you needed to run additional analyses or have changes your code, you can create a new release on GitHub that will correspondingly (and automatically) generate a new DOI on Zenodo. The revised manuscript then includes this updated and release specific DOI. And This solution works fine...until the point that the manuscript is accepted for publication. Since you don't want to necessarily link to pre-release software in your manuscript, you have one final opportunity to update the DOI when you receive the page proofs. At this point, you create the final release on GitHub, uncheck "pre-release", and drop the new DOI into the page proofs for publication. In the simple example here, of an initial submission and revised (and ultimately accepted) manuscript, there will be three releases and four DOIs, as follows:
Of course, you can avoid the burden of these additional steps by including the generic repository DOI in the original manuscript, but if you update the code at any point after the article goes in press, readers will be unsure of exactly what code was used for the manuscript analyses. And isn't that the point of release your code?
One final note for creating updating code/creating new releases on GitHub. If you are truly working in a collaborative environment, you would need to create a pull request to update the code. However, assuming you are the sole author of the analytic code, the original code can simply be updated by editing the current code in your repository. When you update the original code, you can commit the code changes directly to the master branch (effectively bypassing the collaborative nature of GitHub). Previous versions of the code are still accessible to interested parties, by clicking the History link. Creating a new release simple takes a snapshot of the current code and archives it for posterity.