ReproNim Reproducible Basics Module

Command line/shell

Overview

Teaching: 150 min
Exercises: 30 min

Questions

How does using the command line/shell efficiently increase the reproducibility of neuroimaging studies?

How can we ensure that our scripts do the right thing?

Objectives

Understand basic differences among available shells and their use(s) in neuroimaging toolkits

Use a shell as a working medium

Explain the role of some of the most important environment variables

Provide hints on efficient use of the collected shell history of commands

Explain how to make shell scripts more robust and less dangerous

Introduce basics of runtime and unit testing

You can skip this lesson if you can answer these questions:

What factors affect the execution of a given typed command in shell?

How can you script the execution of a list of commands given user input arguments?

How can you guarantee that your script was executed correctly and will not fail during execution?

How can you use text editors to edit your current command line?

How can you quickly recover a sequence of commands you ran in shell (e.g., an analysis from a year ago)?

For Windows OS users: If you do not have a remote or virtual environment with Unix or Linux system, you can enable a “native” Linux shell on your Windows 10 system.

What is a “shell”?

Shell commonly refers to the UNIX shell environment, which in its core function provides users with a CLI (command line interface) to manipulate “environment variables” and to execute external commands. Because desired actions are expressed as typed commands, it is possible to script (program) sets of those commands to be (re-)executed repetitively or conditionally. For example, it provides constructs for loops, functions, and conditions. So, in contrast to GUIs (graphical user interfaces), such automation via scripting is a native feature of a CLI shell. Unlike GUI-integrated environments with lots of functionality exposed in menu items and icons, shell is truly a “black box”, with lots of powerful underlying features integral to efficient use. Since manipulating files is one of the main tasks in a shell, a shell usually comes with common commands (such as cp, mv, etc.) built in or provided by an additional package (e.g., coreutils in Debian).

In this lesson, we first familiarize ourselves with basic (and at times advanced) features of a shell and shell scripting, then review a few key aspects related to reproducibility. We will examine best practices for controlling command execution: knowing which command actually was run, what conditions could have potentially affected its execution, inspecting the available history of the commands, and verifying that a script did not complete while hiding a failed interim execution.

External teaching materials

Before going through the rest of this lesson, you should learn the basics of shell usage and scripting. The following lesson provides a good overview of all basic concepts. Even if you are familiar with shell and shell scripting, please review the materials of the lesson and try to complete all exercises in it, especially if you do not know correct answers right away:

Software Carpentry: The Unix Shell (full: 1 h 35 min, review: 20 min)

Additional materials

If you are interested in knowing more about the history and features of various shells, please review the materials under following external links:

“Teaching Notes” of the above “The Unix Shell” lesson – provides a number of hints and links to interesting related resources

Wikipedia:Unix shell

Relevant books:

Data Science at the Command Line – contains a list of command line tools useful for “data science”

Commonly used shells and their relevance to existing neuroimaging projects

sh - a POSIX-compliant shell; this is a generic name and doesn’t refer to a specific project.
- most portable shell (since it’s the standard)
- many FSL scripts use this shell
ksh - KornSHell; based on older bash, but also became a root for tcsh, zsh, and others.
dash - Debian Almquist SHell; an implementation of a POSIX-compliant shell (sh).
- you will not see it used directly in a shebang
bash - Bourne Again SHell; default shell for Mac OS and many Linux distributions.
- (optionally) POSIX-compliant but with additional features from ksh and csh
- next most portable and most popular for shell scripting (after sh) and generally available (in contrast to zsh)
csh/tcsh - C SHell/Tenex C SHell; shell that aims to provide an interface similar to the C programming language.
- heavily used by FreeSurfer and AFNI (primarily @ scripts)
zsh - powerful, but not POSIX/bash-compatible; inspired by many features from ksh and tcsh.
- rarely used for generic scripting

References

Additional materials

Wikipedia:Comparison of command shells

Challenges

How can you determine what shell you’re currently in?
% echo $SHELL

How do you change the current shell of your current session?

Executing the name of the shell starts it. For example:
% tcsh
would enter a new tcsh session. You can exit it and return to your previous shell by typing exit or just pressing Ctrl-d.

How do you change the login shell (the one you enter when you login) for your account?
% chsh

What is a shebang?

It is the first line in the script, which starts with #! and is followed by the command interpreting the script; e.g., if a file blah begins with the following:
#!/bin/bash
echo "Running this script using bash"
then running ./blah is analogous to calling /bin/bash ./blah . The string “#!” is said out loud as “hash-bang” and therefore is shortened to “shebang”.

Exercise: supplying options to a shebang

To help answer the question, determine which of the following shebangs would be correct and what their effect would be.

#!/bin/bash

#!/bin/bash -e

#!/bin/bash -ex

#!/bin/bash -e -x

Solution

A shebang can support one option, so 1-3 are all correct. The -ex flag instructs the script to exit immediately if a command returns a non-zero exit code (-e), and to print commands and their arguments as they’re executed (-x).

Environment variables

Environment variables are not a feature of a shell per se. Every process on any operating system inherits some “environment variables” from its parent process. A shell just streamlines manipulation of those environments and also uses some of them to guide its own operation. Let’s overview the most commonly used environment variables. These variables are important because they impact what external commands and libraries you are using.

PATH - determines full path to the command to be executed

Whenever a command is run without providing the full path on the filesystem, the shell consults the PATH environment variable to determine where to look for the command. You may have multiple implementations or versions of the same command available at different locations, which may be specified within the PATH variable (separated with a colon). Although this is a very simple concept, it is a workhorse for “overlay distributions” (such as conda). It is also a workhorse for “overlay environments” such as virtualenv in Python, or modules on servers. It can also be a source of much confusion in cases where an unintended command is run. This is why any tool which aims to capture the state of the computational environment for later re-execution needs to store the value of the PATH variable to guarantee that given the same set of files, the same commands are executed. For example, we may have two different versions of AFNI installed in different locations; without specifying the path to a particular installation of AFNI, we may unintentionally run a different version than intended and end up with different results.

How can you determine the full path of a command?

To see which program will actually be used when you run a command, use the which command; e.g.:
$ which afni
/usr/bin/afni
Do not confuse this with the locate command, which (if available) would find a file containing the specified word somewhere in the file name/path.

Beware of built-in (“builtin”) commands

Some commands might be implemented by a shell itself, and that implementation may differ from the one provided by another set of tools.

Note that which is not a builtin command in bash (but is in zsh), meaning that in bash you would not be able to “resolve” builtin commands such as pwd.
% pwd -h      # bash builtin
bash: pwd: -h: invalid option
pwd: usage: pwd [-LP]

% which pwd
/bin/pwd

% /bin/pwd -h   # provided by coreutils
/bin/pwd: invalid option -- 'h'
Try '/bin/pwd --help' for more information.

Exercise: add a new path where the shell will look for commands…

such that those commands take precedence over identically named commands available elsewhere on the PATH?

such that those commands are run only if not found elsewhere on the PATH? (rarely needed/used case)

Solution

For a new path /a/b/c:

Use PATH=/a/b/c:$PATH

Use PATH=$PATH:/a/b/c

Exercise: determine the environment variables used by a process

Since each process inherits and possibly changes environment variables so that its child processes inherit them in turn, it can often be important to be able to inspect them. Given a PID of a currently running process (e.g., the $$ variable in POSIX shell contains a PID of your active shell), how can you determine its environment variables?

Solution

By looking into /proc/PID/environ file on Unix/Linux systems. Try finding the entries in that file separated with the byte 0. Use the tr command to separate them with a line.

ps e PID will list all environment variables along with their values

The e shortcut in htop will show the environment variables of the selected process

Why is ${variable} is preferable over $variable?

Use ${variable} to safely concatenate a variable with another string. For instance, if you had a variable filename that contains the value preciousfile, $filename_modified would refer to the value of the possibly undefined filename_modified variable; on the other hand, ${filename}_modified will produce the desired value of preciousfile_modified.

LD_LIBRARY_PATH - determine which dynamic library is used

To improve maintainability and to make distributions smaller, most programs use dynamic linking to reuse common functions provided by shared libraries. The particular list of dynamic libraries that an executable needs is often stored without full paths as well. Thus, ld.so (e.g., /lib/ld-linux.so.2 on recent Debian systems), which takes care of executing those binaries, needs to determine which particular libraries to load. The same way the PATH variable resolves paths for the execution of commands, the LD_LIBRARY_PATH environment variable resolves paths for loading dynamic libraries. Unlike PATH, however, ld.so does assume a list of default paths (e.g., /lib, then /usr/lib on Linux systems, as defined in /etc/ld.so.conf file(s)). Consequently, you may not have even explicitly set it in your environment!

How can you discover which library is used?

ldd EXEC and ldd LIBRARY list libraries of a given binary. If a library is linked, a full path is provided if found using ld’s default paths or the LD_LIBRARY_PATH variable. For example:
% ldd /usr/lib/afni/bin/afni | head
	linux-vdso.so.1 (0x00007fffd41ca000)
	libXm.so.4 => /usr/lib/x86_64-linux-gnu/libXm.so.4 (0x00007fd9b2075000)
	libmri.so => /usr/lib/afni/lib/libmri.so (0x00007fd9b1410000)
  ...

Swiss army knife to inspect execution on Linux systems

strace traces “system calls” – the calls your program makes to the core of the operating system (i.e., kernel). This way you can discover what files any given program tries to access or open for writing, which other commands it tries to run, etc. Try running strace -e open and provide some command to be executed.

Possible conflicts

It is possible for PATH to point to one environment while LD_LIBRARY_PATH points to libraries from another environment, which can cause either incorrect or hard-to-diagnose behavior later on. In general, you should avoid manually manipulating these two variables.

PYTHONPATH - determine which Python module will be used

The idea of controlling path resolution via environment variables also applies to language-specific domains. For example, Python consults the PYTHONPATH variable to determine search paths for Python modules.

Possible side-effect

Having a mix of system-wide and user-specific installed applications/modules with custom installations in virtualenv environments can cause unexpected use of modules.

You can use python -c 'import sys; print(sys.path)' to output a list of paths your current default Python process will consult to find Python libraries.

Additional considerations

Exercise: “exported” vs. “local” variables

Variables can be “exported” so they will be inherited by any new child process (e.g., when you run a new command in a shell). Otherwise, the variable will be “local”, and will not be inherited by child processes.

How can you determine if a variable was exported or not?

How do you produce a list of all local environments (present in your shell but not exported)?
Solution
Only exported variables will be output by the export command. Alternatively, you can use declare -p to list all variables prepended with a specific attribute:
% LOCAL_VARIABLE="just for now"
% export EXPORTED_VARIABLE="long live the king"
% declare -p | grep \_VARIABLE
declare -x EXPORTED_VARIABLE="long live the king"
declare -- LOCAL_VARIABLE="just for now"
Extrapolate from 1.: declare -p | grep -e '^declare --'

Efficient use of the interactive shell

A shell can be used quite efficiently once you become familiar with its features and configure it to simplify common operations.

aliases

Aliases are shortcuts for commonly used commands and can add options to calls for most common commands. Please review useful aliases presented in 30 Handy Bash Shell Aliases For Linux / Unix / Mac OS X.

Should aliases defined in your ~/.bashrc be used in your scripts?

No. Since ~/.bashrc is read only for interactive sessions, aliases placed there will not be available in your scripts’ environment. Even if they were available after some manipulation, it would be highly inadvisable to use them, since that would render your scripts not portable across machines/users.

Editing command line

bash and other shells use the readline library for basic navigation and manipulation of the command line entry. That library provides two major modes of operation which are inspired by two philosophically different editors – emacs and vim .

Use set -o emacs to enter emacs mode (default) and set -o vi to enter vi mode. Subsequent discussion and examples refer to the default, emacs mode. Learning navigation shortcuts can increase your efficiency with the shell tenfold, so let’s review most common ones to edit the command line text:

`Ctrl-a`	Go to the beginning of the line you are currently typing on
`Ctrl-e`	Go to the end of the line you are currently typing on
`Ctrl-l`	Clear the screen (similar to the clear command)
`Ctrl-u`	Remove text on the line before the cursor position
`Ctrl-h`	Remove preceding symbol (same as backspace)
`Ctrl-w`	Delete the word before the cursor
`Ctrl-k`	Remove text on the line after the cursor position
`Ctrl-t`	Swap the last two characters before the cursor
`Alt-t`	Swap the last two words before the cursor
`Alt-f`	Move cursor forward one word on the current line
`Alt-b`	Move cursor backward one word on the current line
`Tab`	Auto-complete files, folders, and command names

Hints:

If the Alt- combination does not work, you can temporarily work around that by hitting the Esc key once, instead of holding Alt before pressing the following command character.
Although many navigational commands can be achieved also by using “arrow keys” on your keyboard, sometimes using their Ctrl- counterparts is more efficient since it doesn’t require you to move away your hands from the main alphanumeric portion of the keyboard.
Many people find the need to use Ctrl key more often than CapsLock (which was originally used to assist with FORTRAN and other languages where all keywords had to be CAPITALIZED). You can change your environment settings to either swap them or to make CapsLock into another Ctrl key.

If you need a more powerful way to edit your current command line, use

Ctrl-x Ctrl-e (or Alt-e in zsh) Edit command line text in the editor (as defined by VISUAL environment variable)

Some shortcuts can not only edit command line text, but also control the execution of processes:

`Ctrl-c`	Kill currently running process
`Ctrl-d`	Exit current shell
`Ctrl-z`	Suspend currently running process; `fg` restores it, and `bg` places it into background execution

Interrogating shell options with set -o

Shells provide a set of configurable options which can be enabled or disabled using the set command. Use set -o to print the current settings you have in your shell, and then navigate man bash to find their extended description.

When using man, you can search the manual page by using the shortcut / and typing o option-name. You can type ‘n’ for the “next” and ‘p’ for “previous” finding to identify the corresponding section. For example, use set -o noclobber which can be used to forbid overwriting of previously existing files. >| could be used to explicitly instruct the overwriting of an already existing file. “A shell redirect ate my results file” should no longer be given as a valid excuse.

Shell history

By default, a shell stores in memory a history of the commands you have run. You can access this log using the history command. When you exit the shell, those history lines are appended to a file (by default in ~/.bash_history for bash shell). This not only allows you to quickly recall commands you have run recently, but can effectively provide a “lab notebook” of the actions you have performed. The shell history can be very useful for two reasons. First, it can provide a skeleton for your script and help you realize that automating your shell commands is worth the effort. Second, it helps you determine exactly which command you ran to perform any given operation.

Eternal history

Unfortunately, by default shell history is truncated to the 1000 last commands, so you cannot use it as your “eternal lab notebook” without some tuning. Since this is a common problem, solutions exist, so please review available approaches:

shell-chronicle

tune up of PROMPT_COMMAND to record each command as soon as it finishes running

adjustment of HISTSIZE and HISTCONTROL settings, e.g. 1 or 2

Some of the main keyboard shortcuts to navigate shell history are:

`Ctrl-p`	Previous line in the history
`Ctrl-n`	Next line in the history
`Ctrl-r`	Bring up next match backwards in shell history

You can hit Ctrl-r (“reverse-i-search”) and start typing some portion of the command you remember running. Subsequent use of Ctrl-r will bring up the next match, and so on. You will leave “search” mode as soon as you use some other command line navigation command (e.g., Ctrl-e).

Alt-. Insert the final argument of the previous command.

Subsequent use of Alt-. will bring up the last argument of the previous command, and so on.

History navigation exercise

Inspect your shell command history you have run so far:

Use history and uniq commands to figure out what which command you run the most

Experiment with Ctrl-r to find the commands next to the most popular command

Hints for correct/robust scripting in shell

Fail early

By default, your shell script might execute even if some command within it fails. This might lead to some very bad side effects:

operating on incorrect results (e.g., if the command to reproduce results failed, but script continued)
polluting the terminal screen (or log file) with output hiding a point of failure

This is why it’s generally advisable to use set -e in scripts to instruct the shell to exit with non-0 exit code as soon as a command fails.

Note on special commands

POSIX defines some commands as “special”, such that failure to execute would cause the entire script to exit, even without set -e, if they returned a non-0 exit code: break, :, ., continue, eval, exec, exit, export, readonly, return, set, shift, trap, and unset.

If you expect some command to fail and that’s okay, handle its failing execution explicitly; e.g., via:

% command_ok_to_fail || echo "As expected command_ok_to_fail failed"

or just

% command_ok_to_fail || :

Use only defined variables

By default, POSIX shell and bash treat undefined variables as variables containing an empty string:

> echo ">$undefined<"
><

which can lead to many undesired and non-reproducible or undesirable side-effects:

“using” mistyped variable names
“using” variables that were not defined yet due to the logic of the script; for example, imagine the effects of sudo rm -rf ${PREFIX}/ if the PREFIX variable was not defined for some reason (Do not copy this into your terminal!)

The set -u option instructs the shell to fail if an undefined variable is used.

If you intend to use some variable that might still be undefined you can either use ${var:-DEFAULT} to provide an explicit DEFAULT value or define it on the condition that it doesn’t already exist; e.g.:

% : ${notyetdefined:=1}
% echo ${notyetdefined}
1

set -eu

Include set -eu toward the beginning of your shell script. This command sets both “fail early” modes for extra protection to make your scripts more deterministic and thus reproducible.

(Unit) Testing

Run-time testing

To some degree you can consider the set -u feature to be a “run-time test” – i.e., “test if variable is defined, and if not, fail”. In fact, bash and other shells provide a command called test, which can perform various basic checks and return with a non-0 exit code if the condition is not satisfied. For undefined variables, use test -v:

% test -v undefined
% echo $?
1

See the “CONDITIONAL EXPRESSIONS” section of the man bash page for more conditions, such as:

-a file	True if file exists
-w file	True if file exists and is writable
-z string	True if the length of string is non-zero

Instead of calling the test command, you can use [ TEST-EXPRESSION ] syntax, so test -v undefined is identical to [ -v undefined ].

With set -e the whole operation of your script can be stated to be somewhat tested – the script will fail as soon as any command fails. Using such tests/assertions in your code can help guarantee that your script performs as expected.

Exercise: TODO, under construction.

Unit-testing

Unit-testing is a powerful paradigm to verify that pieces of your code (units) operate correctly in various scenarios, and that these assumptions are represented in the code. An interesting observation is that everyone does at least some “testing” by simply running their code/script on an input and checking that the output matches their expectations. Unit-testing just takes this workflow one step further: code such tests in a separate file so you can run them all at once later on (e.g., whenever you change your script) to verify that your script still performs correctly. In the simplest case, you can just copy your test commands into a separate script that would fail if any command within it fails (therefore effectively testing your target script(s)).

For example, the following script could be used to test basic correct operations of AFNI’s 1dsum command:

tfile=$(mktemp)              # create a temporary random file name
printf "1\n1.5\n" >| $tfile  # populate file with known data
result=`1dsum $tfile`        # compute result
[ "$result" = "2.5" ]        # compare result with target value
rm $tfile                    # cleanup

Although it looks trivially simple, this is a powerful basic test to guarantee that 1dsum is available, that it is installed correctly, and that it operates correctly on typical files stored on the file system.

To have better management over a collection of such tests, testing frameworks were developed for shell scripts. Notable ones are:

In general, they provide helpers with the means to execute tests. Helpers then report which ones passed and failed as they run a collection of tests.

Exercise: use existing testing framework to evaluate script

Choose shunit2 or bats (or both) and

Re-write the above test for 1dsum using one of the frameworks. If you do not have AFNI available, you can test generic bc or dc command line calculators that may be available on your system.

Add additional tests to “document” behavior of 1dsum whenever

input file is empty

multiple files are provided

some values are negative

Testing frameworks

Although we’ve focused on testing shell scripts, testing frameworks exist nearly for every programming and scripting language/environment (see Wikipedia: List of unit testing frameworks. We recommend extending this testing framework to code you write at all stages of your analysis pipeline.

Key Points

There are a number of incompatible shells; different neuroimaging tools may use specific shells and thus provide instructions that are not compatible with your current shell.

A command line shell is a powerful tool and learning additional ‘tricks’ can help make its use much more efficient, less error-prone, and thus more reproducible.

Shell scripting is the most accessible tool to automate execution of arbitrary set of commands; this avoids manual retyping of the same commands and in turn avoids typos and erroneous analyses.

Environment variables play a big role in defining script behavior.

You can write automated tests for your commands to ensure correct execution.

Shell scripts are powerful, but – if misused – can cause big problems.

ReproNim Reproducible Basics Module

Command line/shell

Overview

You can skip this lesson if you can answer these questions:

What is a “shell”?

External teaching materials

Additional materials

Commonly used shells and their relevance to existing neuroimaging projects

References

Challenges

How can you determine what shell you’re currently in?

How do you change the current shell of your current session?

How do you change the login shell (the one you enter when you login) for your account?

What is a shebang?

Exercise: supplying options to a shebang

Solution

Environment variables

PATH - determines full path to the command to be executed

How can you determine the full path of a command?

Beware of built-in (“builtin”) commands

Exercise: add a new path where the shell will look for commands…

Solution

Exercise: determine the environment variables used by a process

Solution

Why is ${variable} is preferable over $variable?

LD_LIBRARY_PATH - determine which dynamic library is used

How can you discover which library is used?

Swiss army knife to inspect execution on Linux systems

Possible conflicts

PYTHONPATH - determine which Python module will be used

Possible side-effect

Additional considerations

Exercise: “exported” vs. “local” variables

Solution

Efficient use of the interactive shell

aliases

Should aliases defined in your ~/.bashrc be used in your scripts?

Editing command line

Interrogating shell options with set -o

Shell history

Eternal history

History navigation exercise

Hints for correct/robust scripting in shell

Fail early

Note on special commands

Use only defined variables

set -eu

(Unit) Testing

Run-time testing

Unit-testing

Exercise: use existing testing framework to evaluate script

Testing frameworks

Key Points

Should aliases defined in your `~/.bashrc` be used in your scripts?

Interrogating shell options with `set -o`