Hands on reproducible analysis of neuroimaging data: Nov. 2-3, UCSD

(Neuro)Debian/Git/GitAnnex/DataLad: Distributions and Version Control

Overview

Teaching: 15 min
Exercises: 30 min
Questions
  • What are the best ways to obtain and track information about software, code, and data used or produced in the study?

Objectives
  • Rehearse the knowledge of Git on how to obtain repositories locally, inspect history, commit changes

  • Go over basic/main commands of APT package manager for preparing computational environments

Introduction

The title for this section brings together a wide range of technologies which are at first glance completely independent: GNU/Linux distributions—such as Debian—which provide computing environments, and version control systems(VCS)—such as Git—which originate in software development. But both distributions and version control systems have a feature in common: they distributions and version control systems have a feature in common: they provide means to obtain, or in other words to install, and to manage content locally. Moreover, installed content components typically carry unambiguous specification of the installed version and often its origin – where it came from. It is this characteristic which makes them ideal vehicles to be used to obtain components (code, software, data etc.) necessary for your research instead of manually downloading and installing them.

In this training section we will concentrate on learning only a few basic core commands for a number of popular technologies, which will help you to discover and obtain necessary for the research project components. Moreover, we will present a few features of DataLad which will be used in subsequent lectures.

This “Distributions” Google Spreadsheet provides a somewhat simplified overview and an aligned comparison of the basic concepts and commands of Debian/Conda/PyPI/Git/git-annex/DataLad if we consider their “versioned distribution” functionality. Please consult that spreadsheet to complete hands-on challenges below, before sneaking into the “full” answer.

More thorough coverage

If you are interested to learn more about VCS and Git in particular, and package managers/distributions, we encourage you to go through following materials at any other convenient moment later on your own:

(Neuro)Debian

Debian

Debian is the largest community-driven open source project, and one of the oldest Linux distributions. Its platform and package format (DEB) and package manager (APT) became very popular, especially after Debian was chosen to be the base for many derivatives such as Ubuntu and Mint. At the moment Debian provides over 40,000 binary packages virtually for any field of endeavour including many scientific applications. Any number of those packages could be very easily installed via a unified interface of the APT package manager and with clear information about versioning, licensing, etc. Interestingly, almost all Debian packages now are themselves guaranteed to be reproducible (see Debian: Reproducible Builds).

Because of such variety, wide range of support hardware, acknowledged stability, adherence to principles of open and free software, Debian is a very popular “base OS” for either direct installation on hardware, or in the cloud or containers (docker or singularity).

NeuroDebian

NeuroDebian project was established to integrate software used for research in psychology and neuroimaging within the standard Debian distribution.

To facilitate access to the most recent versions of such software on already existing releases of Debian and its most popular derivative Ubuntu, NeuroDebian project established its own APT repository. So, in a vein, such repository is similar to Debian backports repository, but a) it also supports Ubuntu releases, b) typically backport builds are uploaded to NeuroDebian as soon as they are uploaded to Debian unstable, c) contains some packages which did not make it to Debian proper yet.

To enable NeuroDebian on your standard Debian or Ubuntu machine, you could apt-get install neurodebian (and follow the interactive dialogue) or just follow the instructions on http://neuro.debian.net .

Exercise: check NeuroDebian

Check if NeuroDebian is “enabled” in your VM Ubuntu installation

For those using older VM images for NeuroDebian, you might have to use apt-cache policy instead of apt policy

Note: “God” privileges needed

Operations which modify the state of the system (so not just searching/showing) require super user to do it, so it is typical to have sudo tool installed, and used as a prefix to the command (e.g. sudo do-evil to run do-evil as super user)

Exercise: Search and Install

Goal is to search for and install application(s) to visualize neuroimaging data (using terminal for the purpose of the exercise, although there are good GUIs as well)

Exercise: Multiple available versions

The goal of the exercise is to be able to install the desired version of a tool

Git

We all probably do some level of version control of our files, documents, and even data files, but without a version control system (VCS) we do it in an ad-hoc manner:

A Story Told in File Names by Jorge Cham, http://www.phdcomics.com/comics/archive_print.php?comicid=1323

Unlike distributions (like Debian, conda, etc) where we (users) have only the power of selecting some already existing versions of software, the main purpose of VCS do not only provide access to existing versions of content, but give you the “super-power” to establish new versions by changing or adding new content. They also often facilitate sharing the derived works with a complete and annotated history of content changes.

Exercise – What is Git?

Exercise – tell Git about yourself!

Since Git makes a record of changes, please configure git to know your name and email (you could as well use fake email, just better be consistent to simplify attribution)

Check the content of ~/.gitconfig which is the --global config for git.

Without --global configuration changes would be stored in .git/config of a particular repository

Hint: use git COMMAND --help

to obtain documentation specific to the COMMAND. Recall navigation shortcuts from the previous section. Similarly --help is available for datalad COMMANDs.

Exercise – install AKA clone

Clone https://github.com/repronim/sfn2018-training locally

Question: What is the “version” of the content you got?

git clone brings you the most recent content available in the “default branch” of the repository. So what “version” of content did we get?

Git “philosophy” in 2 minutes

Exercise: Time travel through the full history of changes.

git-annex

git-annex is a tool which allows to manage data files within a git repository, without committing (large) content of those data files directly under git. In a nutshell, git-annex

so later on, if you have access to the clones of the repository which have the copy of the file, you could easily git annex get its content (which will download/copy that file under .git/annex/objects) or git annex drop it (which would remove that file from .git/annex/objects).

As a result of git not containing the actual content of those large files, but instead containing just symlinks, and information within git-annex branch, it becomes possible to

We will have exercises working with git-annex repositories in the next section

DataLad

DataLad relies on git and git-annex to provide a platform which encapsulates many aspects from “distributions” and VCS for management and distribution of code, data, and computational environments. Relying on git-annex flexibility to reference content from the web, datalad.datalad.org provides hundreds of datasets (git/git-annex repositories) which provide access to over 12TB of neuroscience data from different projects (such as openfmri.org, crcns.org etc). And because all content is unambiguously versioned by git and git-annex there is a guarantee that the content for the same version would be the same across all clones of the dataset, regardless where content was obtained from.

DataLad embraces version control and modularity (visit poster 2046 “YODA: YODA’s organigram on data analysis” for more information) to facilitate efficient and reproducible computation. With DataLad you can not only gain access to the data resources and maintain your computational scripts under version control system, you can maintain the full record of the computation you performed in your study. Let’s conclude this section with just a very minimalistic neuroimaging study we perform while recording the full history of changes. Two sections ahead we will will go through a more complete example.

Exercise: Install a dataset

Use datalad install command to install a sample dataset from http://datasets.datalad.org/?dir=/openfmri/ds000114 :

Exercise: Explore its history

Q1: What is its current version?

Q2: Did 1.0.0 version of the dataset follow BIDS?

Q3: What is the difference between 2.0.0 and 2.0.0+1 versions?

Task: Assuming that the dataset is also compliant with the released BIDS specification 1.0.2, fix BIDSversion field in dataset_description.json and datalad save the change with descriptive message

Exercise: Explore and obtain a data file

Q: Look at sub-01/ses-test/anat/sub-01_ses-test_T1w.nii.gz. What is it? Does it have content?

Exercise: Perform basic analysis and make a run record

Use nib-ls from nibabel to get and store basic statistic on the file we just obtained in an INFO.txt file in the top directory of the dataset. When figured out the command to run, use datalad run to actually run it so it makes a record for generated INFO.txt file.

Key Points