Overview
Teaching: 150 min Exercises: 30 minQuestions
How does using the command line/shell efficiently increase the reproducibility of neuroimaging studies?
How can we ensure that our scripts do the right thing?
Objectives
Understand basic differences among available shells and their use(s) in neuroimaging toolkits
Use a shell as a working medium
Explain the role of some of the most important environment variables
Provide hints on efficient use of the collected shell history of commands
Explain how to make shell scripts more robust and less dangerous
Introduce basics of runtime and unit testing
You can skip this lesson if you can answer these questions:
- What factors affect the execution of a given typed command in shell?
- How can you script the execution of a list of commands given user input arguments?
- How can you guarantee that your script was executed correctly and will not fail during execution?
- How can you use text editors to edit your current command line?
- How can you quickly recover a sequence of commands you ran in shell (e.g., an analysis from a year ago)?
Shell commonly refers to the UNIX shell environment, which in its
core function provides users with a CLI (command line interface) to
manipulate “environment variables” and to execute external
commands. Because desired actions are expressed as typed commands, it
is possible to script (program) sets of those commands to be
(re-)executed repetitively or conditionally. For example, it provides constructs
for loops, functions, and conditions. So, in contrast to GUIs (graphical
user interfaces), such automation via scripting is a native feature of
a CLI shell. Unlike GUI-integrated environments with lots of
functionality exposed in menu items and icons, shell is truly a “black
box”, with lots of powerful underlying features integral to efficient use.
Since manipulating files is one of the main tasks in a shell, a shell usually
comes with common commands (such as cp
, mv
, etc.) built in
or provided by an additional package (e.g., coreutils
in Debian).
In this lesson, we first familiarize ourselves with basic (and at times advanced) features of a shell and shell scripting, then review a few key aspects related to reproducibility. We will examine best practices for controlling command execution: knowing which command actually was run, what conditions could have potentially affected its execution, inspecting the available history of the commands, and verifying that a script did not complete while hiding a failed interim execution.
External teaching materials
Before going through the rest of this lesson, you should learn the basics of shell usage and scripting. The following lesson provides a good overview of all basic concepts. Even if you are familiar with shell and shell scripting, please review the materials of the lesson and try to complete all exercises in it, especially if you do not know correct answers right away:
Additional materials
If you are interested in knowing more about the history and features of various shells, please review the materials under following external links:
- “Teaching Notes” of the above “The Unix Shell” lesson – provides a number of hints and links to interesting related resources
- Wikipedia:Unix shell
Relevant books:
- Data Science at the Command Line – contains a list of command line tools useful for “data science”
References
Additional materials
How can you determine what shell you’re currently in?
% echo $SHELL
How do you change the current shell of your current session?
Executing the name of the shell starts it. For example:
% tcsh
would enter a new
tcsh
session. You can exit it and return to your previous shell by typingexit
or just pressingCtrl-d
.
How do you change the login shell (the one you enter when you login) for your account?
% chsh
What is a shebang?
It is the first line in the script, which starts with
#!
and is followed by the command interpreting the script; e.g., if a fileblah
begins with the following:#!/bin/bash echo "Running this script using bash"
then running
./blah
is analogous to calling/bin/bash ./blah
. The string “#!” is said out loud as “hash-bang” and therefore is shortened to “shebang”.
Exercise: supplying options to a shebang
To help answer the question, determine which of the following shebangs would be correct and what their effect would be.
#!/bin/bash
#!/bin/bash -e
#!/bin/bash -ex
#!/bin/bash -e -x
Solution
A shebang can support one option, so 1-3 are all correct. The
-ex
flag instructs the script to exit immediately if a command returns a non-zero exit code (-e
), and to print commands and their arguments as they’re executed (-x
).
Environment variables are not a feature of a shell per se. Every process on any operating system inherits some “environment variables” from its parent process. A shell just streamlines manipulation of those environments and also uses some of them to guide its own operation. Let’s overview the most commonly used environment variables. These variables are important because they impact what external commands and libraries you are using.
Whenever a command is run without providing the full path on the
filesystem, the shell consults the PATH
environment variable to determine
where to look for the command. You may have multiple
implementations or versions of the same command available at different
locations, which may be specified within the PATH variable (separated
with a colon). Although this is a very simple concept, it is a workhorse
for “overlay distributions” (such as conda). It is
also a workhorse for “overlay environments” such as virtualenv
in Python, or modules on servers.
It can also be a source of much confusion in cases where an unintended
command is run. This is why any tool which aims to capture the
state of the computational environment for later re-execution
needs to store the value of the PATH variable to guarantee that
given the same set of files, the same commands are executed. For example,
we may have two different versions of AFNI installed in different locations;
without specifying the path to a particular installation of AFNI, we may
unintentionally run a different version than intended and end up with different results.
How can you determine the full path of a command?
To see which program will actually be used when you run a command, use the
which
command; e.g.:$ which afni /usr/bin/afni
Do not confuse this with the
locate
command, which (if available) would find a file containing the specified word somewhere in the file name/path.
Beware of built-in (“builtin”) commands
Some commands might be implemented by a shell itself, and that implementation may differ from the one provided by another set of tools.
Note that
which
is not a builtin command in bash (but is in zsh), meaning that in bash you would not be able to “resolve” builtin commands such aspwd
.% pwd -h # bash builtin bash: pwd: -h: invalid option pwd: usage: pwd [-LP] % which pwd /bin/pwd % /bin/pwd -h # provided by coreutils /bin/pwd: invalid option -- 'h' Try '/bin/pwd --help' for more information.
Exercise: add a new path where the shell will look for commands…
such that those commands take precedence over identically named commands available elsewhere on the
PATH
?such that those commands are run only if not found elsewhere on the
PATH
? (rarely needed/used case)Solution
For a new path /a/b/c:
- Use
PATH=/a/b/c:$PATH
- Use
PATH=$PATH:/a/b/c
Exercise: determine the environment variables used by a process
Since each process inherits and possibly changes environment variables so that its child processes inherit them in turn, it can often be important to be able to inspect them. Given a
PID
of a currently running process (e.g., the$$
variable in POSIX shell contains aPID
of your active shell), how can you determine its environment variables?Solution
- By looking into
/proc/PID/environ
file on Unix/Linux systems. Try finding the entries in that file separated with the byte0
. Use thetr
command to separate them with a line.ps e PID
will list all environment variables along with their values- The
e
shortcut inhtop
will show the environment variables of the selected process
Why is ${variable} is preferable over $variable?
Use
${variable}
to safely concatenate a variable with another string. For instance, if you had a variablefilename
that contains the valuepreciousfile
,$filename_modified
would refer to the value of the possibly undefinedfilename_modified
variable; on the other hand,${filename}_modified
will produce the desired value ofpreciousfile_modified
.
To improve maintainability and to make distributions smaller, most
programs use dynamic linking to reuse common functions provided by
shared libraries. The particular list of dynamic libraries that an
executable needs is often stored without full paths as well. Thus,
ld.so
(e.g., /lib/ld-linux.so.2
on recent Debian systems), which
takes care of executing those binaries, needs to determine which
particular libraries to load. The same way the PATH
variable resolves
paths for the execution of commands, the LD_LIBRARY_PATH
environment variable resolves paths for loading dynamic
libraries. Unlike PATH
, however, ld.so
does assume a list of
default paths (e.g., /lib
, then /usr/lib
on Linux systems, as
defined in /etc/ld.so.conf
file(s)). Consequently, you may not have
even explicitly set it in your environment!
How can you discover which library is used?
ldd EXEC
andldd LIBRARY
list libraries of a given binary. If a library is linked, a full path is provided if found usingld
’s default paths or theLD_LIBRARY_PATH
variable. For example:% ldd /usr/lib/afni/bin/afni | head linux-vdso.so.1 (0x00007fffd41ca000) libXm.so.4 => /usr/lib/x86_64-linux-gnu/libXm.so.4 (0x00007fd9b2075000) libmri.so => /usr/lib/afni/lib/libmri.so (0x00007fd9b1410000) ...
Swiss army knife to inspect execution on Linux systems
strace traces “system calls” – the calls your program makes to the core of the operating system (i.e., kernel). This way you can discover what files any given program tries to access or open for writing, which other commands it tries to run, etc. Try running
strace -e open
and provide some command to be executed.
Possible conflicts
It is possible for
PATH
to point to one environment whileLD_LIBRARY_PATH
points to libraries from another environment, which can cause either incorrect or hard-to-diagnose behavior later on. In general, you should avoid manually manipulating these two variables.
The idea of controlling path resolution via environment variables
also applies to language-specific domains. For example, Python consults
the PYTHONPATH
variable to determine search paths for Python
modules.
Possible side-effect
Having a mix of system-wide and user-specific installed applications/modules with custom installations in virtualenv environments can cause unexpected use of modules.
You can use
python -c 'import sys; print(sys.path)'
to output a list of paths your current default Python process will consult to find Python libraries.
Exercise: “exported” vs. “local” variables
Variables can be “exported” so they will be inherited by any new child process (e.g., when you run a new command in a shell). Otherwise, the variable will be “local”, and will not be inherited by child processes.
- How can you determine if a variable was exported or not?
- How do you produce a list of all local environments (present in your shell but not exported)?
Solution
- Only exported variables will be output by the
export
command. Alternatively, you can usedeclare -p
to list all variables prepended with a specific attribute:% LOCAL_VARIABLE="just for now" % export EXPORTED_VARIABLE="long live the king" % declare -p | grep \_VARIABLE declare -x EXPORTED_VARIABLE="long live the king" declare -- LOCAL_VARIABLE="just for now"
- Extrapolate from 1.:
declare -p | grep -e '^declare --'
A shell can be used quite efficiently once you become familiar with its features and configure it to simplify common operations.
Aliases are shortcuts for commonly used commands and can add options to calls for most common commands. Please review useful aliases presented in 30 Handy Bash Shell Aliases For Linux / Unix / Mac OS X.
Should aliases defined in your
~/.bashrc
be used in your scripts?No. Since
~/.bashrc
is read only for interactive sessions, aliases placed there will not be available in your scripts’ environment. Even if they were available after some manipulation, it would be highly inadvisable to use them, since that would render your scripts not portable across machines/users.
bash
and other shells use the readline
library for basic navigation
and manipulation of the command line entry. That library provides two
major modes of operation which are inspired by
two philosophically different editors
– emacs
and vim
.
Use set -o emacs to enter emacs mode (default) and set -o vi to enter vi mode. Subsequent discussion and examples refer to the default, emacs mode. Learning navigation shortcuts can increase your efficiency with the shell tenfold, so let’s review most common ones to edit the command line text:
Ctrl-a |
Go to the beginning of the line you are currently typing on |
Ctrl-e |
Go to the end of the line you are currently typing on |
Ctrl-l |
Clear the screen (similar to the clear command) |
Ctrl-u |
Remove text on the line before the cursor position |
Ctrl-h |
Remove preceding symbol (same as backspace) |
Ctrl-w |
Delete the word before the cursor |
Ctrl-k |
Remove text on the line after the cursor position |
Ctrl-t |
Swap the last two characters before the cursor |
Alt-t |
Swap the last two words before the cursor |
Alt-f |
Move cursor forward one word on the current line |
Alt-b |
Move cursor backward one word on the current line |
Tab |
Auto-complete files, folders, and command names |
Hints:
Alt-
combination does not work, you can temporarily work around that
by hitting the Esc
key once, instead of holding Alt
before pressing the
following command character.Ctrl-
counterparts is more efficient since it doesn’t require you to move
away your hands from the main alphanumeric portion of the keyboard.Ctrl
key more often than
CapsLock
(which was originally used to assist with FORTRAN and other languages
where all keywords had to be CAPITALIZED). You can
change your environment settings
to either swap them or to make CapsLock
into another Ctrl
key.If you need a more powerful way to edit your current command line, use
Ctrl-x Ctrl-e (or Alt-e in zsh) |
Edit command line text in the editor (as defined by VISUAL environment variable) |
Some shortcuts can not only edit command line text, but also control the execution of processes:
Ctrl-c |
Kill currently running process |
Ctrl-d |
Exit current shell |
Ctrl-z |
Suspend currently running process; fg restores it, and bg places it into background execution |
Interrogating shell options with
set -o
Shells provide a set of configurable options which can be enabled or disabled using the
set
command. Useset -o
to print the current settings you have in your shell, and then navigateman bash
to find their extended description.When using
man
, you can search the manual page by using the shortcut/
and typingo option-name
. You can type ‘n’ for the “next” and ‘p’ for “previous” finding to identify the corresponding section. For example, useset -o noclobber
which can be used to forbid overwriting of previously existing files.>|
could be used to explicitly instruct the overwriting of an already existing file. “A shell redirect ate my results file” should no longer be given as a valid excuse.
By default, a shell stores in memory a history of the commands you
have run. You can access this log using the history
command. When you exit
the shell, those history lines are appended to a file (by default in
~/.bash_history
for bash shell). This not
only allows you to quickly recall commands you have run recently, but
can effectively provide a “lab notebook” of the actions you have
performed. The shell history can be very useful for two reasons. First, it can provide
a skeleton for your script and help you realize that automating
your shell commands is worth the effort. Second, it helps you determine exactly
which command you ran to perform any given operation.
Eternal history
Unfortunately, by default shell history is truncated to the 1000 last commands, so you cannot use it as your “eternal lab notebook” without some tuning. Since this is a common problem, solutions exist, so please review available approaches:
- shell-chronicle
- tune up of PROMPT_COMMAND to record each command as soon as it finishes running
- adjustment of
HISTSIZE
andHISTCONTROL
settings, e.g. 1 or 2
Some of the main keyboard shortcuts to navigate shell history are:
Ctrl-p |
Previous line in the history |
Ctrl-n |
Next line in the history |
Ctrl-r |
Bring up next match backwards in shell history |
You can hit Ctrl-r
(“reverse-i-search”) and start typing some portion of the command you
remember running. Subsequent use of Ctrl-r
will bring up the next match, and so
on. You will leave “search” mode as soon as you use some other
command line navigation command (e.g., Ctrl-e
).
Alt-. |
Insert the final argument of the previous command. |
Subsequent use of Alt-.
will bring up the last argument of the previous command,
and so on.
History navigation exercise
Inspect your shell command history you have run so far:
- Use
history
anduniq
commands to figure out what which command you run the most- Experiment with
Ctrl-r
to find the commands next to the most popular command
By default, your shell script might execute even if some command within it fails. This might lead to some very bad side effects:
This is why it’s generally advisable to use set -e
in scripts
to instruct the shell to exit with non-0 exit code as soon as a command fails.
Note on special commands
POSIX defines some commands as “special”, such that failure to execute would cause the entire script to exit, even without
set -e
, if they returned a non-0 exit code:break
,:
,.
,continue
,eval
,exec
,exit
,export
,readonly
,return
,set
,shift
,trap
, andunset
.
If you expect some command to fail and that’s okay, handle its failing execution explicitly; e.g., via:
% command_ok_to_fail || echo "As expected command_ok_to_fail failed"
or just
% command_ok_to_fail || :
By default, POSIX shell and bash treat undefined variables as variables containing an empty string:
> echo ">$undefined<"
><
which can lead to many undesired and non-reproducible or undesirable side-effects:
sudo rm -rf ${PREFIX}/
if the
PREFIX
variable was not defined for some reason (Do not copy this into your terminal!)The set -u
option instructs the shell to fail if an undefined variable is
used.
If you intend to use some variable that might still be undefined
you can either use ${var:-DEFAULT}
to provide an explicit DEFAULT
value or define it on the condition that it doesn’t already exist; e.g.:
% : ${notyetdefined:=1}
% echo ${notyetdefined}
1
set -eu
Include
set -eu
toward the beginning of your shell script. This command sets both “fail early” modes for extra protection to make your scripts more deterministic and thus reproducible.
To some degree you can consider the set -u
feature to be a “run-time
test” – i.e., “test if variable is defined, and if not, fail”. In fact, bash
and other shells provide a command called test
, which
can perform various basic checks and return with a non-0 exit code if the
condition is not satisfied. For undefined variables, use test -v
:
% test -v undefined
% echo $?
1
See the “CONDITIONAL EXPRESSIONS” section of the man bash
page for more
conditions, such as:
-a file | True if file exists |
-w file | True if file exists and is writable |
-z string | True if the length of string is non-zero |
Instead of calling the test
command, you can use
[ TEST-EXPRESSION ]
syntax, so test -v undefined
is identical to
[ -v undefined ]
.
With set -e
the whole operation of your script can be stated to be
somewhat tested – the script will fail as soon as any command fails.
Using such tests/assertions in your code can help guarantee that
your script performs as expected.
Exercise: TODO, under construction.
Unit-testing is a powerful paradigm to verify that pieces of your code (units) operate correctly in various scenarios, and that these assumptions are represented in the code. An interesting observation is that everyone does at least some “testing” by simply running their code/script on an input and checking that the output matches their expectations. Unit-testing just takes this workflow one step further: code such tests in a separate file so you can run them all at once later on (e.g., whenever you change your script) to verify that your script still performs correctly. In the simplest case, you can just copy your test commands into a separate script that would fail if any command within it fails (therefore effectively testing your target script(s)).
For example, the following script could be used to test basic correct operations
of AFNI’s 1dsum
command:
tfile=$(mktemp) # create a temporary random file name
printf "1\n1.5\n" >| $tfile # populate file with known data
result=`1dsum $tfile` # compute result
[ "$result" = "2.5" ] # compare result with target value
rm $tfile # cleanup
Although it looks trivially simple, this is a powerful basic test
to guarantee that 1dsum
is available, that it is installed
correctly, and that it operates correctly on
typical files stored on the file system.
To have better management over a collection of such tests, testing frameworks were developed for shell scripts. Notable ones are:
In general, they provide helpers with the means to execute tests. Helpers then report which ones passed and failed as they run a collection of tests.
Exercise: use existing testing framework to evaluate script
Choose shunit2 or bats (or both) and
- Re-write the above test for
1dsum
using one of the frameworks. If you do not have AFNI available, you can test genericbc
ordc
command line calculators that may be available on your system.- Add additional tests to “document” behavior of
1dsum
whenever
- input file is empty
- multiple files are provided
- some values are negative
Testing frameworks
Although we’ve focused on testing shell scripts, testing frameworks exist nearly for every programming and scripting language/environment (see Wikipedia: List of unit testing frameworks. We recommend extending this testing framework to code you write at all stages of your analysis pipeline.
Key Points
There are a number of incompatible shells; different neuroimaging tools may use specific shells and thus provide instructions that are not compatible with your current shell.
A command line shell is a powerful tool and learning additional ‘tricks’ can help make its use much more efficient, less error-prone, and thus more reproducible.
Shell scripting is the most accessible tool to automate execution of arbitrary set of commands; this avoids manual retyping of the same commands and in turn avoids typos and erroneous analyses.
Environment variables play a big role in defining script behavior.
You can write automated tests for your commands to ensure correct execution.
Shell scripts are powerful, but – if misused – can cause big problems.