Using bash scripts to automate #AVpres workflows

I’m having lots of Ubuntu fun in my new role as Digital Preservation Data Specialist at the British Film Institute’s (BFI) National Archive. Every day is spent working with Linux workstations, both physical and virtual, running Ubuntu v18.04 and v19.10. I’m pleased to have an opportunity to focus my command line and scripting skills, while also working to test and improve RAWcooked with such an extensive and diverse film collection. As part of my new role I’m editing, writing and running bash scripts to automate mass RAWcooked preservation of 3PB worth of DPX sequences. Using bash scripts with cron scheduling is a new experience for me, so it feels like another excellent opportunity to share my steps and help someone else along.

If you’re like me then you are probably fairly new to this type of hand crafted preservation workflow, or microservice architecture. You might want to check out an earlier blog post I wrote about DIY Python microservices in audiovisual preservation workflows. In particular, Chapter 2: Microservices, features an overview and link to a wonderful document by Dave Rice and Annie Schweikert about microservice architecture using the OAIS model. Similarly, if you’re a complete RAWcooked or command line newbie it might benefit you to look at my other blog post DPX preservation workflow with RAWcooked which gives a short introduction to command line interfaces (CLI) such as Terminal, and installation guide to many of the tools I’ll be talking about in the following sections.

Please note all code examples given in article are for bash scripts executed in Unix/Linux Terminal, but may be transferable to other command line interfaces with a little tweaking. They are distributed under the MIT License – see page 4 for more details.  Please test in a safe environment while you’re learning and don’t hesistate to drop me a message with any feedback or questions. And as it’s been a year since I wrote my first ‘For the love of FOSS’ post I’m revisiting my Daft Punk motivational tracks, and this one always makes me think of automation.

Pages

  1. Harder: Introduction to shell scripts
    – Strings and Integers
    – Variables
    – Functions
    – Loops
    – Pipelines
    – The Shebang
    – Further reading
  2. Better: Writing scripts
    – Your first script
    – My MediaConch script
    – Writing your own
  3. Faster: Automating your scripts
    – Netdata
    – GNU Parallel
    – Cron scheduling
    – Flock
    – Cronic
  4. Stronger: Sharing scripts
    – The BFI RAWcooked scripts
    – AVpres bash script repositories
    – Thanks and links
    – MIT License information

1.Harder: Introduction to shell scripts

Shell is a basic programme, taking user friendly commands and changing them to the underlying language of the computer operating system’s kernel. A terminal window presents a simple front end that we users can interact with. For MacOS Unix, and open source Linux users this is most likely presented in the form of a Terminal window running bash, or Bourne-Again SHell, named after shell developer Stephen Bourne of Bell Labs. Even Windows 10 will now let you download and run Linux bash. It’s not just somewhere to issue commands, it’s an environment for managing every aspect of your operating system, the storage, memory allocation, and much more.

Terminal command lines running RAWcooked

A script combines multiple shell commands into a list which the script reads through line by line and executes, allowing for the swift automation of multiple processes.  Every command you’ve used in Terminal is a command you can run within a script, and any programme you’ve used from command line, such as RAWcooked or FFmpeg, can be called in the script too.

I’ve found it much easier to get up and running with bash scripts than I have learning Python. I wouldn’t say this is because bash is easier, it’s just that bash feels more natural having experienced operating FFmpeg from the command line and using basic bash principles like for loops. Also there are some excellent debugging tools for bash and amazing websites that really educate and explain mistakes so clearly.

Strings and Integers

A string is a character value containing letters, numbers or symbols and they are represented in ‘single’ or “double” quotes. Integers are numerical value that represent positive or negative whole numbers like 3 or -54 and are not represented in quotes. An integer can be converted into a string, but a string cannot be converted into an integer. Shell doesn’t handle fractions natively – also known as floats – but you can use programmes like bc command for this, it’s a calculator programming language you can incorporate into your bash scripts.

When writing scripts it’s important to be able to differentiate between strings and integers as they are handled differently. The table below demonstrates how string and numeric (integer or binary) comparisons differ:

DescriptionNumeric ComparisonString Comparison
less than-lt<
greater than-gt>
equal-eq=
not equal-ne!=
less or equal-leN/A
greater or equal-geN/A
Shell comparion example:[ 100 -eq 50 ]; echo $?[ “GNU” = “UNIX”]; echo $?
Example borrowed from Bash scripting tutorial for beginners, Linuxconfig.org

Variables

variable_directory="path_to_files/directory1/directory2"

This is one of the first terms you encounter when writing any scripts.  Variables are used to store information to be referenced and edited in the script, for example a variable can be a path to a regularly called directory (as above) or list of media files you want to batch re-encode. It is helpful to think of variables as little boxes that hold information, and they retain this data in memory for use repeatedly throughout the script’s processes. Shell runs two types of variables, shell variables which operate locally within shell’s resources, and environmental variables which operate globally across the operating system and with external programmes. When a variable is being called it is always called with the $ symbol, and when used as part of a path it will probably be formatted like this ${variable} using expanded parameters. The curly braces {} are used when a variable is followed by characters which are not part of the variable name. You can allocate any alphanumerical label to a variable, as long as it starts with an alphanumerical character or an underscore _ and a descriptive name makes it easier to remember what’s in the box!

Functions

Another concept to introduce is Functions, which allow you to generate one command for something you might repeat frequently within the script. In the BFI’s case this could be exporting information to a log file which we do a lot. So when BFI Digipres Department Head, Stephen McConnachie, introduced a function called ‘log’ to his latest script I was keen to understand how it worked. I hope to implement it into future scripts!  You can write a short function to run within a script, or you can create a library of functions that you call from external shell scripts.  A function can be used to change the state of variables, exit to end a shell script, return a supplied value to a calling section of script before ending the function, and echo an output to stdout – read more about stdin, stdout and stderr here. Here’s a short example of how a function follows here, the commands could be echo statements to a log or ffmpeg commands.

function good_name () {
  command1
  command2
}

Loops

‘For‘ loops, or recursive loops will iterate through a list until the task completes with the last item. I’ve have found some instances when this hasn’t happened due to unknown limitations within the shell script, so it’s always wise to test your loops before launching into AVpres workflows. Hopefully you’ll have experience of writing loops in Terminal and using the wildcard symbol * (asterisk), or escape characters like \ when writing terminal commands.  Below are two loop examples I’ve used regularly. They both call on Python script to execute copying directories or files to LTO tape.

for dir in */; do python copyit.py -l "$dir" path_to_LTO/; done

This for loop copies a collection of directories using Python script copyit.py to a new location on LTO tape. You can read more about copyit.py from the IFIscripts here.

find directory/ -name "*.mkv" -exec python copyit.py -l {} path_to_LTO/ \;

Find’ commands aren’t really classed as loops, I don’t think, but they can recursively move through lists in a similar way to for loops. It this example the find search is isolating any files with the extension .mkv within directory/ and using the -exec (execute) command to call the python script copyit.py, and it will move all .mkv files to LTO tape. I’ve written more about using the checksum verification script copyit.py in my DIY Python Microservices blog here. It’s an amazing little script!

There’s also the ‘while’ loop, which has lots of features and can give you many more fun alternatives to play with when scripting. While loops are control flow statements that allow looping commands to execute based on a specified condition, or a command to run at regular intervals such as the short example below. It’s also possible to nest while loops so you can create loops within loops – something we will try later on.

This short loop generates a text file using the touch programme every two minutes. To ensure you don’t keep making a file with the same name the ‘date +%s’ command changes the date name to a string of numbers calculated in Unix Epoch Time, seconds counted since 1970-01-01 00:00:00 UTC. The sleep command ensures the .txt creation occurs every 120 seconds, or two minutes.

while true; do
 touch text-'date +%s'.txt
 sleep 120
done

Pipelines

I learnt about the pipeline within my first week at the BFI, and my goodness what an amazing revelation this little line was. A pipeline is created when you link two or more commands in a line, separated by the pipeline control operators  |  or |&.  In the example following the first command is passed to the second command, and the exit value for this will be the results of the last command in the pipeline. So from the concatenate, or cat kern.log command, only lines containing ‘killed’ will be returned. I’m looking for instances of processes being prematurely killed due the CPU being out of memory.

cat /var/log/kern.log | grep 'killed'

You can make fairly sizeable and elaborate pipelines, filtering information from a log until you have exactly the key line of text you need, like in this example:

cat global.log | grep "2020-03-25" | grep -v "video/" | grep ".mkv" | cut -c 95- | sort -u

This output uses cat (concatenate) to view the contents of the log file which contains date, path information and action triggers carried out by another Python script.  The pipeline’s first grep searches for a specific date, 25th March 2020. All items found with this date range then pass to the next grep which uses -v to reverse the selection. It returns everything except entries containing the “video/” directory. Next the grep pulls out lines only containing files with the extension “.mkv”. Finally the pipeline trims the beginning of the output using the cut command. This cut removes unique date and time information, allowing the last command in the pipeline to sort the unique items into ascending order, -u removes any identical lines returning one instance only. This is a command I’ve been using a lot to check RAWcooked mkv files have been ingested into the BFI’s Digital Preservation Infrastructure successfully. With a few changes in the pipeline I can filter thousands of lines of log information in seconds, delivering exactly the material I need to view. Pipelines rock!

The Shebang

Each script has a first entry at the top called a shebang and indicates which shell you want to interpret the script. This example indicates the preference for bash shell script:

#!/bin/bash

If you don’t specify this the script will execute using the system’s default shell interpreter, usually /bin/sh which uses Bourne shell, the predecessor to bash.  Unfortunately running a bash script in /bin/sh may result in errors as Bourne shell isn’t forward compatible. To specify bash, and avoid this issue, always start a script with a bash shebang, like above. You may also see examples written like this:

#!/usr/bin/env bash

This runs a script from whichever bash executable appears first in a user’s $PATH variable, making the script transferable between platforms or operating systems. To check the full list of available shell scripts on your system you can type the following command, which will also helpfully give you the paths for your shebang entry:

cat /etc/shells

The shebang is the only time a shell script reads a line that starts with a # character.  Comments in scripts also start with #. Using them is really important for helping unpick your more complex coding choices or instructing others what a command might mean.

Further reading

This is such a brief introduction that I definitely recommend taking a look at some other resources such as Ashley Blewer Training pages, and ShellScript.sh which provides a very thorough understanding of variables, functions, and everything you’ll ever need to know. Do read up on the difference between glob and regex, and shell parameter expansions as they regularly trip me up!  The manual for bash is a wonderful resource too, and can be accessed online here, or by typing man bash into your Terminal window. Don’t forget that programmes you already use in command line can be called and executed in shell scripts – so FFmpeg, RAWcooked, Python or any of the fleet of tools available to Linux or Unix Systems like mv, cp, rsync, cat, echo, and the mighty grep.  To aid my learning I bought two books which have been incredibly useful. The first is Your Linux Toolbox by Julia Evans, a collection of smaller zines full of easily digestible Linux tricks. The next is Small, Sharp Software Tools by Brian P Hogan which I’m half way through and it’s already been amazingly useful! I’m a fairly regular visitor to stackoverflow.com where nearly every question I have to ask has been answered before!

On page two we’ll look at tools for writing scripts and break down a couple of short examples.

One thought on “Using bash scripts to automate #AVpres workflows

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s