Overview
Teaching: 15 min Exercises: 30 minQuestions
Can we represent the history of an entire analysis?
Can we use this history to repeat the analysis?
Objectives
Learn to capture analysis in structured form
Learn how to query the information
You can skip this lesson if you can answer these questions? —>
- Can you capture the details of all your analysis steps in a form that can be repeated?
- Do you know how to represent an analysis using the W3C PROV model?
This lesson provides an approach to capturing the details of an analysis in a structured form that can be reused to repeat and introspect the analysis.
Although not essential it is helpful to have an understanding of:
“Provenance is information about entities, activities, and people involved in producing a piece of data or thing, which can be used to form assessments about its quality, reliability or trustworthiness.” (source: w3c)
Most current publications cannot be used to assess “quality, reliability or trustworthiness” of the results. Publications do not capture every detail of the analysis steps, the data sources, and the metadata necessary to make such assessments. The goal of provenance capture is to make such assessments feasible and practical. Thus far we have focused on how to describe data, computation environments, and analysis scripts. This lesson extends this to relating this information in a form that allows capturing details of execution steps, thus moving from specification to provenance.
There are many ways in which one can record what steps are being carried out. A general neuroimaging analysis involves many connected steps as outlined in our overview to this module. There are different tools that can capture pieces of these steps. Like most research, not all analysis attempts end in the results used for publication. Analysis often involves understanding the data from different view points, using different routines to figure out what works well and what does not, and changes in questions that are being asked of the data. Thus capturing the provenance is a fluid process and will likely require mastery of different tools.
The Unix shell has several tools for capturing the history of commands and details of execution.
script
. Here is a basic introductionThere are also more complex history recording tools that export information in a structured form or deposit into a database.
The previously mentioned tools capture history in different forms and therefore
support different aspects of reproducible research. The command line tools like
script
, strace
, and asciinema
capture information that needs to be looked
at by a user to determine which commands to re-execute. However, these tools do
not capture the environment or data files. Reprozip can capture the entire
environment, data, and software necessary for repeating a previous analysis, and
therefore provides more complete reproducibility. However, reprozip only runs on
Linux systems, while some of the other tools can be used on any Unix-like system
(e.g. Mac OS, GNU/Linux)
Key Points
Analysis can be captured in a way to repeat it
Understanding points of human interaction and decision making are essential for reproducibility