Using bash scripts to automate #AVpres workflows

2. Better: Writing scripts

So hopefully you have some understanding of the value of using bash scripts to combine multiple software tools to deliver an automated output tailored to your needs. You can write your shell scripts in a Terminal editor like Nano, in script writing software, or even in a normal text editor. As I’m learning Python too, I use Visual Studio Code for more complicated scripts and Nano for short test scripts. Visual Studio Code is a free GUI software that has some excellent features like debugging, and incorporated Terminal which lets you choose which shell interpreter you want to run a script in! A great way to see what kind of differences exist between shells.  If you want to try Nano you first need to download it using Terminal for Ubuntu or Debian, using a command like this:

sudo apt-get install nano

MacOS comes with Nano as standard, though you might want to upgrade it to the latest version at some point. To launch it just type nano into the Terminal, then you can have a go at writing your first shell script.

Your first script

If this is your first ever go at script writing you can try making a simple script like the example below, which uses bash built in command ‘echo’ to print three lines of text. Let’s assume you’re writing it in Nano, but if you want to use a text editor or scripting software then the experience won’t be much different.

#!/bin/bash

# Print out three lines of text

echo "Hello #digipres friends!"
echo -n "This will print three lines of text"
echo -e "\nGO \t ON \t TRY!\n"
A nano window showing the test script being edited

The first entry you will type is your shebang indicating that you want to use bash to interpret the script. The second line is a comment, signified by the # at the beginning and can be whatever you want it to be. It wont be read or executed, it’s just there for the script writer. I’ve heard Stephen paraphrase Jason Scott’s comment on metadata, describing comments as love letters to your future self, helping you remember why you structured the script a particular way. The executable lines of this script are the three lines that start with echo. Echo is used to print, or display strings of text. I use it mostly to output status updates to log files, as you’ll see later.

Once you’ve typed the script you can either save it with ctrl + s, or exit and save using ctrl + x, using the extension ‘.sh’, such as test.sh. The file will save into the location your Terminal was in before you launched Nano. You can check this by using the command print working directory, typing pwd into Terminal. Before you can test run the script you need to change the execute permissions of the shell script, which requires the use of chmod command:

chmod +x test.sh

If you are a User logged in with admin privileges then you need to add sudo to the start of the command, and provide an admin password when prompted. Now you can run your script by typing this:

./test.sh

You need to add the ./ to the start of the script which ensures that the path is exact. This is an excellent protective step to stop malicious scripts making their way onto your system and being called from any location by accident. By using ./ you’re specifying it’s the script in the same directory your terminal is located in. I think you can add bash scripts or a script directory to your path environmental variable so you can call scripts you use regularly from any working Terminal directory.

You’ll see from the terminal output after running the script that the second line “# Print out three lines of text”, doesn’t feature in the terminal output when running the script because it’s a comment.  However, if a # is enclosed within a string using quotes then it will print along with the rest of the characters, such as in the first printed line “Hello #digipres friends!”.

So that’s a very basic first script to try out. Now let’s look at a few other scripts, and try to understand how to read and interpret them.  Adapting other people’s scripts is a great way to learn, particularly using websites like Stackoverflow.com to trouble shoot and make changes.

My MediaConch script

The film collection at the BFI is vast and diverse, and in my first three months at the BFI I’ve come across some interesting issues with the RAWcooking some of these DPX sequences. Some have highlighed software bugs that we’ve helped fix, while other problems have been caused by the quantity and size of the DPX scans themselves. As a result the FFv1 Matroska files have suffered from variations, such as bit rates and slice counts. This has prompted me to define a MediaConch policy which sets out our acceptable MVK parameters. This includes minimum acceptable bit rate, a check that the file is lossless and progressive, that slices are 16 or above, and do the Matroska and FFv1 have error detection… The script also checks for attachments present to ensure the RAWcooked reversibility data is embedded within the Matroska. The goal was to incorporate this policy into the BFI’s post-rawcook.sh script, which cleans up folders full of completed and partially cooked files. The script below is the shortened script I’ve written for this purpose. It can also be used independently to search through any folder of MKVs and compare them against a policy, while outputting results to a log file of your choice. You can see it incorporated into the full Post-rawcook.sh script on page 4.

#!/bin/bash -x

# Script to search for MKV files modified more than 20 mins ago.
# Check the MKV against a MediaConch policy with search to return "fail" files.
# Loop that separate failed files from pass files and moves fails to 'killed' folder.
# echo outputs written to post_rawcooked.log file.
# Temporary .txt file created just to store names of failed files, deleted at end.

mkv_destination="/mnt/qnap/Public/rawcooked/"

find ${mkv_destination}mkv_cooked/ -name "*.mkv" -mmin +20 | while IFS= read -r files; do
check=$(mediaconch --force -p /mnt/isilon/rawcooked/mkv_policy.xml "$files" | grep "fail")
filename=$(basename "$files") 
  if [ -z "$check" ];
    then
      echo "*** RAWcooked MKV file $filename has passed the Mediaconch policy. Whoopee ***" >> ${mkv_destination}post_rawcooked.log
    else
      {
        echo "*** FAILED RAWcooked MKV $filename has failed the mediaconch policy. Grrrrr ***"
        echo "*** Moving $filename to killed directory, and amending log fail_${filename}.txt ***"
        echo "$check"
      } >> ${mkv_destination}post_rawcooked.log
        echo "$filename" > ${mkv_destination}temp_mediaconch_policy_fails.txt
  fi
done

grep ^N ${mkv_destination}temp_mediaconch_policy_fails.txt | parallel --progress --jobs 10 "mv ${mkv_destination}mkv_cooked/{} ${mkv_destination}killed/{}"
grep ^N ${mkv_destination}temp_mediaconch_policy_fails.txt | parallel --progress --jobs 10 "mv ${mkv_destination}mkv_cooked/{}.txt ${mkv_destination}logs/fail_{}.txt"

I borrowed sections from other BFI scripts written by Stephen McConnachie, wrote some lines myself and added useful bits I found on stackoverflow.com.  The script draws on external programmes MediaConch alongside local Linux programmes such as mv, echo, grep and basename. I’ve added comments to help explain how the script functions, but I’ve also broken it down below where I explain the structure of the script a little more. Feel free to give it a go! I’ve added the MediaConch policy at the bottom of this section if you don’t have one of your own to use.

#!/bin/bash -x

The script starts with a shebang, but alongside it you will see the -x. This is short for xtrace, or execution trace forcing Terminal to display all the commands and their arguments while they are being executed. Shell tracing mode is one of a few tools that belong to shell’s debug mode and it’s a really wonderful gift! You can also use -n, short for no execution if you want to dry run a script using syntax checking mode, or -v for full verbosity showing all lines in a script as they are read. You can use one, or a combination of these debug tools when you first test a script. Read more about debugging on the Bash Beginners Guide.

mkv_destination="/mnt/qnap/Public/rawcooked/"

This line creates a variable called mkv_destination, and places a string path to a folder within it. This is useful if you use a path multiple times in a script, saving labouriously retyping it each time. Also if you need to change a path destination to another storage device for example, you only have to do it once, and all uses of the variable will update accordingly. Watch out how you leave your path trailing slashes – they can be tricky and can break scripts really easily.

find ${mkv_destination}mkv_cooked/ -name "*.mkv" -mmin +20 | while IFS= read -r files; do

This line demonstrates the versatility of the mighty little pipeline |.  Two things are going on in this line. First a recursive find loop is looking through a folder called mkv_cooked for “*.mkv” files that haven’t been modified in the last 20 minutes. The second is the beginning of the while loop which takes the results from the first half and places them in a variable called files. It uses ‘IFS= read -r’ to recursively read the names of the files and their path and pass them to later commands. I’ll be honest, I am not sure if this IFS statement is necessary in this context, but it’s definitely appropriate when reading from text files line by line. Something like ‘while read -r files; do’ might work just as well, but I’ve not given it a go. The pipe symbol only moves the results of the first find loop, and files that don’t contain “*.mkv” are ignored.  The ‘do’ syntax signifies the beginning of the loop, with ‘done’ being used to close the loop in a later section.

check=$(mediaconch --force -p /mnt/isilon/rawcooked/mkv_policy.xml "$files" | grep "fail")

Here we use our first external programme, MediaConch, to pass the contents of the new $files variable through a conformance policy and then pipe pass or fail outputs to a grep which searches just for the “fail” responses. The $check variable becomes either a zero entry string for the $file variable, or a string containing the complete “fail” output statement. As MediaConch is an external programme you will need to install it before you can use it.

filename=$(basename "$files")

This line uses the Linux programme basename to trim down the path information leaving just the filename. This is used when you want to just return the filename without the lengthy path information, for example outputting data to log files making them easier to read.

  if [ -z "$check" ];
    then
      echo "*** RAWcooked MKV file $filename has passed the Mediaconch policy. Whoopee ***" >> ${mkv_destination}post_rawcooked.log
    else
      {
        echo "*** FAILED RAWcooked MKV $filename has failed the mediaconch policy. Grrrrr ***"
        echo "*** Moving $filename to killed directory, and amending log fail_${filename}.txt ***"
        echo "$check"
      } >> ${mkv_destination}post_rawcooked.log
        echo "$filename" > ${mkv_destination}temp_mediaconch_policy_fails.txt
  fi
done

This block contains an if/else statement that takes the results of $check and returns different outputs based on receiving a fail response, or no response (meaning a pass). [ -z “$check” ] returns true when the string in $check equals zero, ie policy outcome does not contain “fail”. -z is a conditional expression which you can only use with a string input. There are quite a few of these nifty little shortcuts, well worth taking a look at in man bash. These files are passed straight to the first then statement which prints their $filename and a comment string to the post_rawcooked.log file from the echo programme. If the $check returns a string entry of “fail” the script skips to the else statements which enact a few actions on $file. These are enclosed within { } to show they are linked commands, with one collective >> ${mkv_destination}post_rawcooked.log.

The final echo sees the $filename for the failed Mediaconch policy printed to a temporary .txt document for reference in the next section of code. This time the script uses just > which overwrites a file, unlike >> which just append to the bottom of a document. Because I want this list to refresh with each script use it’s easier to use >, and avoid having to run a delete command at the end. Some times you might also see &>> used which means stdout and stderr outputs are appended to the end of the log. The &>> can only be used from version 4 bash onward.

grep ^N ${mkv_destination}temp_mediaconch_policy_fails.txt | parallel --progress --jobs 10 "mv ${mkv_destination}mkv_cooked/{} ${mkv_destination}killed/{}"
grep ^N ${mkv_destination}temp_mediaconch_policy_fails.txt | parallel --progress --jobs 10 "mv ${mkv_destination}mkv_cooked/{}.txt ${mkv_destination}logs/fail_{}.txt"

This section uses the grep search to read the temporary_mediaconch_policy_fails.txt file for all ^N names and passes them to GNU Parallel to move the MKVs to a ‘killed’ folder – more on GNU Parallel later. The second grep search uses the same method to move the MKV log file, prepending it with “fail_” to the logs folder.

You might notice that sometimes we use .log and sometimes .txt. Stephen passed this habit on to me suggesting that a .txt file is seen as temporary file whereas .log extension suggests the file stores valuable information accrued over time and is therefore of value and not to be deleted. Finally, here’s my MediaConch policy in case you want to give it a go, just save it as an .xml and give it a try against your RAWcooked MKV files.

<?xml version="1.0"?>
<policy type="or" name="RAWcooked MKV error checks" license="CC-BY-SA-4.0+">
  <description>Test that the video file is suitable for archiving.
- Container format is Matroska with error detection (CRC)
- Video format is FFV1 with error detection (CRC) and with Intra mode
</description>
  <policy type="and" name="BFI RAWcooked MKV checks">
    <rule name="Container is MKV" value="Format" tracktype="General" occurrence="*" operator="=">Matroska</rule>
    <rule name="MKV version 4 or greater" value="Format_Version" tracktype="General" occurrence="*" operator=">=">4</rule>
    <rule name="Unique ID is present" value="UniqueID" tracktype="General" occurrence="*"/>
    <rule name="Duration field exists" value="Duration" tracktype="General" occurrence="*"/>
    <rule name="Container uses error detection" value="extra/ErrorDetectionType" tracktype="General" occurrence="*" operator="=">Per level 1</rule>
    <rule name="Attachments present" value="extra/Attachments" tracktype="General" occurrence="*" operator="=">RAWcooked reversibility data</rule>
    <rule name="Overall bit rate more than" value="OverallBitRate" tracktype="General" occurrence="*" operator=">=">300</rule>
    <rule name="Video is FFV1" value="Format" tracktype="Video" occurrence="*" operator="=">FFV1</rule>
    <rule name="FFv1 version 3.4 or later" value="Format_Version" tracktype="Video" occurrence="*" operator=">=">3.4</rule>
    <rule name="GOP size of 1" value="Format_Settings_GOP" tracktype="Video" occurrence="*" operator="=">N=1</rule>
    <rule name="FFV1 is lossless" value="Compression_Mode" tracktype="Video" occurrence="*" operator="=">Lossless</rule>
    <rule name="FFV1 is progressive" value="ScanType" tracktype="Video" occurrence="*" operator="=">Progressive</rule>
    <rule name="Frame Rate is Constant?" value="FrameRate_Mode" tracktype="Video" occurrence="*" operator="=">CFR</rule>
    <rule name="Video uses error detection" value="extra/ErrorDetectionType" tracktype="Video" occurrence="*" operator="=">Per slice</rule>
    <rule name="Video minimum slice count" value="extra/MaxSlicesCount" tracktype="Video" occurrence="*" operator=">=">16</rule>
    <rule name="Colour space is RGB" value="ColorSpace" tracktype="Video" occurrence="*" operator="=">RGB</rule>
    <rule name="Duration field exists" value="Duration" tracktype="Video" occurrence="*"/>
  </policy>
</policy>

Writing your own

This week I wanted to write a new script so I thought I’d document my process, in case it helps you with writing your first scripts. This new script checks a list of folders that should’ve been successfully RAWcooked and converted to MKV, and compares them to a log file that contains ingest progress of MKV files into the BFI’s Digital Preservation Infrastructure (DPI). It’s good to build scripts around examples you already have, in this case I decided I could make it a reasonably close copy of the MediaConch script above. It uses a while loop fed from a grep of a text file. The if/else statement is fairly similar to the MediaConch script, though there are fewer echo outputs. My notes to the right helped me shape the script, and although I didn’t write them up correctly the stages following helped shape it into a functioning script.

I was worried there might be a problem if the grep in line 18 might return unexpected strings, but as the ‘deleted’ search comes after searching for the N name of the file I’m fairly sure it should only ever return zero or the full line containing the N number and the ‘deleted’ message – as long as the text file supplying the N numbers is functional, and no changes are made to the log output from the Python scripts. I’m using the -z conditional expression again, so this zero output or string output is really useful, remember -z equals true if a zero string is returned. So simple and elegant! So here’s the finished script, followed by some of the methods I used to get to this point.

#!/bin/bash -x

# Variables for script, so it's easy to relocate
log_path="/mnt/isilon_lt2/"
timestamp=$(date "+%Y-%m-%d  -  %H.%M.%S")
dpx_list="/mnt/isilon/dpx_cooked/"
global_log="/mnt/qnap/lto_project/global_copy.log"

# Create list ordered by the extensions _01of*
ls "$dpx_list" | sort -n -k1.10 > ${log_path}temp_dpx_list.txt

# Begin with writing start time to log
echo "====== $timestamp ====== Comparing dpx_cooked folder to global.log ======" >> ${log_path}global_log_check.log

# Search within txt file for N_ numbers, passed to $files variable in while loop
grep ^N ${log_path}temp_dpx_list.txt | while IFS= read -r files; do
 # Comparison of $files list against the global log, stored in variable $on_global
 on_global=$(grep "$files" "$global_log" | grep 'deleted')
 # if $on_global returns no string (ie, has no 'deleted' entry / any entry in global log)
 if [ -z "$on_global" ]
   then
     echo "****** ${files}.mkv has not passed into autoingest ******" >> ${log_path}global_log_check.log
   # else, if it has an entry for both titled, and 'deleted'
   else
     echo "====== ${files}.mkv has been RAWcooked and is in Imagen ======" >> ${log_path}global_log_check.log
  fi
done

While typing up this script in Visual Studio Code I ran tests on short single line sections of the script in Terminal, altering the commands as I moved through the sections. I recommend creating a safe environment to test new scripts that wont damage any actual files. Once the script seemed finished I saved it with .sh extension and then changed the chmod +x permissions of the file, before running it in Terminal. This quickly threw up some failures, so I activated the inbuilt bash debugger xtrace, using -x behind the shebang, which revealed where the errors lay. Terminal xtrace output was saying that the files grepped for “had not passed into autoingest” continually when I knew they were actually there. This drew my attention to line 16 (18 in the script above), and I found that my first ill thought out double grep was failing, as you never find instances of both search terms in one line – silly me! As the ‘deleted’ term better reflected the completion of the autoingest process I opted to keep this, and remove the ‘persistence’ grep.

Once it seems to be working properly the final stage is to copy and paste it to the AMAZING web page shellcheck.net. Here the script is assessed line by line and warnings tell you where your syntax is incorrect or could be inproved. Each hazard symbol provides links to explanations of the error type and how you can amend it. This tool is an amazing friend when you’re learning bash script. For example, today I relearnt the difference between glob and regex, because I was wrongly using glob syntax in my grep searches, when grep natively uses regex. This was easily changed by turning my grep “N*” (glob) to grep ^N (regex). It’s tools like this that make bash scripting so much quicker to pick up, so make sure you bookmark shellcheck.net.

This script saved me a massive amount of time while comparing 600+ Matroska files cooked a few months ago against the global.log. However, a newly cooked Isilon which finished in the last few day had the script run a comparison today and had many instances where the MKV file had not completed up to the ‘deleted’ stage yet. So I thought it would be helpful to run a second search against these files that returned nothing and to return a more appropriate statement than “HAS NOT PASSED INTO AUTOINGEST”. To do this I decided to make a nested loop, where a nested if/else statement checked for a second grep ‘Skip object’. You see ‘Skip object’ in global.log when an item has made it into the autoingest queue, but hasn’t had a chance to complete yet. So the script from line 16 now looks like this:

grep ^N ${log_path}temp_dpx_list.txt | while IFS= read -r files; do
    on_global=$(grep "$files" "$global_log" | grep 'deleted')
    if [ -z "$on_global" ]
        then
            skipped=$(grep ${files} ${global_log} | grep 'Skip object')
            if [ -z "$skipped" ]
                then
                  echo "****** ${files}.mkv HAS NOT PASSED INTO AUTOINGEST! ******" >> ${log_path}global_log_check.log
                else
                  echo "===== ${files}.mkv has been RAWcooked and is being ingested" >> ${log_path}global_log_check.log
            fi
        else
            echo "====== ${files}.mkv has been RAWcooked and is in Imagen ======" >> ${log_path}global_log_check.log
    fi
done

You’ll notice a new variable called $skipped. Its grep is the first thing to run against the files without a ‘deleted’ response. It then repeats another if [ -z "$skipped" ] check and returns a new response for the else statement where ‘Skip object’ is present (see examples below). And it works straight away with no errors, so I’m very happy! This is a really good illustration of how nested while loops can improve a search request returning more precise information. I love this script already and it’s going to become an important part of my RAWcooked workflow here at the BFI. Also, I can see lots of useful script off-shoots using this methodology to help search and compare data.

On page 3 we’ll look at ways to further automate your automative scripts.

One thought on “Using bash scripts to automate #AVpres workflows

Leave a comment