Overview
Teaching: 10 min Exercises: 10 minQuestions
Why and how does using the command line/shell efficiently increase reproducibility of neuroimaging studies?
How can we assure that our scripts do the right thing?
Objectives
Provide hints on efficient use of the collected shell history of commands
Explain how to make shell scripts more robust and less dangerous
Shell commonly refers to the UNIX shell environment, which in its
core function provides users with a CLI (command line interface) to
manipulate “environment variables” and to execute external
commands. Because desired actions are expressed as typed commands, it
becomes possible to script (program) sets of commands to be
(re-)executed, using constructs such as loops, functions, and
conditional statements. In contrast to graphical user interfaces
(GUIs), automation via scripting is a native feature of a CLI shell.
Unlike GUIs, which have lots of functionality exposed in menu items and
icons, the shell is truly a “black box”, which has a lot of powerful
features that you need to discover first to be able to efficiently use
it. Because manipulation of files is one of the main tasks to
accomplish in a shell, a shell usually either comes with common
commands (such as cp
, mv
, and rm
) built-in or is accompanied by
an additional package (e.g., coreutils
in Debian) providing those
helpful command line utilities.
More thorough coverage
In this training event we assume that you know shell basics and will not go through detailed presentation of various aspects that are relevant for making your work in a shell—and research activities in general—more reproducible. We refer you to our full version of the training materials on Unix shells, which covers additional topics such as differences between shells, the importance of environment variables, and unit testing. We encourage you to go through the following materials on your own at a later time:
The majority of the commands are accompanied with easily accessible information about their purpose and command line options.
--help
Typically, commands accept a --help
argument (or less commonly
-help
, e.g. AFNI) and respond by printing a concise description of
the entire program and list of its (common) command line options.
Exercise (warm up)
Run some commands you know (e.g.
bash
,cat
) with--help
.
The man
command provides access to manpages available for many (if
not the majority) of the available commands. Manpages often provide a
very detailed description and consist of many pages of textual
documentation. It gets presented to you in a pager
—a basic command
for viewing and navigation of the text file. The pager
to be used is specified
as an environment variable, $PAGER
, two common examples of which are
more
and less
. Some useful less
shortcuts include
Excercise: Navigate
man
forgit
Question: What is the short description of the
git
command?
man -k
searches through all available short descriptions and command
names.
Excercise: Find commands for work with “containers”
Solution
% man -k containers If you don't use containers, instead of the last command try % man -k shell
[]
Vi and Vim are closely related that are commonly used as the default editor on Unix and GNU/Linux systems. While they are powerful editors, they have a steep learning curve. There is a number of tutorials available online (e.g. this randomly googled one). Here we will just teach you how to exit Vi/Vim if you end up in this unknown territory:
Environment variables are not a feature of a shell per se. Every process on any operating system inherits some “environment variables” from its parent process. A shell just streamlines manipulation of those environments and also uses some of them directly to guide its own operation. Let’s overview the most commonly used and manipulated environment variables. These variables are important because they impact what external commands and libraries you are using.
Excercise: what is the path in the environment right now
Simply print the current path in the terminal
% echo $PATH
Different paths are separated by :
(For Windows, ;
). The order
in which we look packages is determined by the order of paths separated
by :
Excercise: determine which program (full path to it) executes when you run
git
To see which command will actually be run use
which COMMAND
:% which git /usr/bin/git
What about the
python
command? Trywhich -a
as well.% which -a python
which -a
is a neat way to see all the versions of a package you have available across paths in the environment, listed in the order of paths specified in $PATH environment variable.Do not mix up
which
withlocate
, which (if available) would just find a file with that word somewhere in the file name/path.
Question: How can you add a path to where the shell looks for commands?
So that a command at that location takes precedence over a command with the same name found elsewhere on
PATH
?So that a command at the location is run only if a command with the same name is not found elsewhere on
PATH
? (This is a rarely needed.)Solution
For a new path /a/b/c:
- PATH=/a/b/c:$PATH
- PATH=$PATH:/a/b/c
You can avoid typing full command names and paths by using your shell’s completion capabilities. The details and features of completion vary across shells, but most shells offer at least some form of completion. The rest of this section will assume a bash shell.
As an example, you could type mkd
followed by TAB. The
text will be expanded to mkdir
if that’s the only command on PATH
starting with those three letters. If those letters don’t uniquely
identify a candidate, the text will be expanded to the unique stem,
and you can hit TAB again to see the remaining choices.
This isn’t very useful for a five letter name, but it can save you
from typing out unwieldy things like gunzip
sub-16_task-balloonanalogrisktask_run-03_bold.nii.gz
.
Advanced completion
Some shells can complete more than just commands and paths, but they may require additional configuration to do so. With bash and a Debian-based system, installing the
bash-completion
package will add support for more advanced features, such as completing options for some common commands (e.g.mkdir --
then TAB will display a list ofmkdir
’s long options). And any program can provide its own set of bash-completion rules, which is especially valuable for complex command-line interfaces likegit
.
By default, a shell records the history of commands you have run. You
could access it using the history
command. When you exit the shell,
those history lines are appended to a file (~/.bash_history
by
default in a bash shell). This not only allows you to quickly recall
commands you have run recently, but can provide you a “lab notebook” of
the actions you have performed. Thus the shell history could be
indispensable to
Eternal history
Unfortunately by default shell history is truncated to the last 1,000 commands, so you cannot use as your “eternal lab notebook” without some tuning. Since it is a common problem, solutions exist. Please review available approaches:
- shell-chronicle
- tune up of PROMPT_COMMAND to record each command as soon as it finishes running
- adjustment of
HISTSIZE
andHISTCONTROL
settings, e.g. 1 or 2
Some of the main keyboard shortcuts to navigate shell history are
Ctrl-p | Previous line in the history |
Ctrl-n | Next line in the history |
Ctrl-r | Bring up next match backwards in shell history (very very useful one) |
You can hit Ctrl-r and start typing some portion of the command you remember running. Hitting Ctrl-r again will bring up the next match and so on. You will leave “search” mode as soon as you use some other command line navigation command (e.g. Ctrl-e).
If you have had enough of searching with Ctrl-r you can simply Ctrl-c to exit incremental search, while still leaving the last search term on the terminal.
Alt-. | Insert last position argument of the previous command. |
Hitting Alt-. again will bring up the last argument of the previous command and so on. Mac users will have to use Esc-.
History navigation exercise
Inspect your shell command history you have run so far:
- use
history
anduniq
to find you most frequently used command- experiment using Ctrl-r to find commands next to the most popular command
Question: What is a shebang?
A shebang is a line at the beginning of a file that specifies what program should be used to interpret the script. It starts with
#!
followed by the command. For example, if a fileblah
begins with the following:#!/bin/bash echo "Running this script using bash"
then running
./blah
is analogous to calling/bin/bash ./blah
. The string “#!” is read out loud as “hash-bang” and therefore is shortened to “shebang.”
By default your shell script might proceed with execution even if some command within it fails. This might lead to very bad side effects:
That is why it is generally advisable to use set -e
in scripts. This
instructs the shell to exit with a non-zero exit code right when some
command fails.
If you expect that some command might fail and it is OK, handle its failing execution explicitly, e.g. via
% command_ok_to_fail || echo "As expected command_ok_to_fail failed"
or just
% command_ok_to_fail || :
By default POSIX shell and bash treat undefined variables as variables containing an empty string:
> echo ">$undefined<"
><
which also could lead to many undesired and non-reproducible side effects:
sudo rm -rf ${PREFIX}/
if PREFIX
variable was not defined for some reason.The set -u
option instructs the shell to fail if an undefined variable is
used.
If you intend to use some variable that might be undefined, you can use
${var:-DEFAULT}
or ${var:=DEFAULT}
to provide an explicit default
value. Both of the :-
and :=
form evaluate to the default value if
the variable is unset; the difference is that the :=
variant also
assigns the default value back to the variable.
% : ${notyetdefined:=1}
% echo ${notyetdefined}
1
set -eu
Just set both “fail early” modes for extra protection to make your scripts more deterministic and thus reproducible.
Do not copy/paste full paths in your script(s). Define a variable for
each “root directory” for a number of relevant paths, like a
studypath=/home/me/thestudy
, datapath=/data/commonmess
. Then use
relative paths in specifications, appending them to a “root directory”
path if needed, e.g. "$datapath/participants.tsv"
. This allows your
script to work across different machines and on other datasets that
conform to the same layout. Relative paths are also preferable when
defining the relationship between two components (e.g. datasets, as you
will see in the future sections).
Key Points
A command line shell is a powerful tool and learning additional ‘tricks’ can help make its use more efficient, less error-prone, and thus more reproducible
Shell scripting is the most accessible tool to automate execution of an arbitrary set of commands. This avoids manual retyping of the same commands and in turn avoids typos and erroneous analyses