Using bash scripts to automate #AVpres workflows

4. Stronger: Sharing scripts

In the spirit of the open-source software and the community that this and my other workflows are inspired by, I feel it’s really important to share our two RAWcooked scripts in full below. I hope they are of use to others, and I’d welcome improvements and feedback about how they work in different environments. If you do decide to share some scripts in the future don’t forget to comment generously! They’re immensely helpful to anyone learning scripting for the first time and the more verbose and open we all are with our workflows and practises, the more other archives internationally can benefit from shared craftsmanship.

[Update Oct 2021: Please check out our full DPX preservation repository for the lastest versions of the following scripts. You can access them at the BFI DPX Encoding repository on GitHub]

The BFI RAWcooked scripts

There are two key scripts we use around the clock at the BFI, and many thanks to Stephen McConnachie who composed most of these scripts, and the BFI for allowing them to be shared. They are works in progress and are regularly tweaked and changed as RAWcooked snapshots are adapted to our workflow, we add new elements to the workflow, or my own understand of bash improves.

Both these scripts are managed by Linux crontab and Linux’s own lock programme, Flock, to prevent multiple scripts running concurrently. As you know, rawcook.sh runs every fifteen minutes via /etc/crontab, but because parallels is active processing 20 cooks at a time Flock prevents the script from restarting. When the script cycle of 20 ends you know it wont take long before it restarts, hence the 15 minute prompts. Post-rawcook.sh is set to run on crontab every 8 hours allowing for rawcook.sh to build up some MKV files ready for assessment. This makes clocking into work at 9am and 4pm two of my favourite times of the day when I can check in to see how many more successful cooks have completed processing, and are ready to be ingested into the BFI DPI.

I wont provide full explanations of these scripts because this blog is long enough already! I’ve left in our comments where appropriate and I hope these will be enough to illustrate what’s happening. Remember, I’m no expert and there will be different and more informed interpretations you should double check online or with professional scripters. There will be better ways to execute these scripts, and if you have any suggestions for code improvements please drop me a line.

Rawcook.sh

The first bash script is rawcook.sh. This script works its way through a directory containing hundreds of DPX sequences, sorts them into 20 random sequences, then passes them through to GNU Parallel which controls the RAWcooking process running four jobs simultaneously. The RAWcooked command may be familiar to you and it sits behind the parallel –jobs 4 command (line 42). Various stages of the process are output to a log file so any errors or problems can be identified at the appropriate stage. You’ll see lots of echo >> outputs to logs! I hope to add a log function that handles this more neatly in future.

#!/bin/bash

# Create full timestamp variable for use in logging
date_FULL=$(date +'%Y-%m-%d  - %T')

# ===============================================================
# === Source and destination folder paths as variables ==========
# ===============================================================

# Variables for Isilon and BFI QNAP rawcooked folder
dpx_source="/mnt/isilon/"
mkv_destination_qnap="/mnt/qnap/Public/rawcooked/"

# Removes the temporary list again so cook runs from fresh list each time
rm ${mkv_destination_qnap}temporary_rawcook_list.txt
# Use temp_queued_list.txt to check before cooking a dpx folder, to make sure it has not been cooked, or is currently being cooked.
ls ${mkv_destination_qnap}mkv_cooked > ${mkv_destination_qnap}temp_queued_list.txt
# Create new temporary_rawcook_list.txt each time script runs
touch ${mkv_destination_qnap}temporary_rawcook_list.txt

# Write a START note to the logfile
echo "========== DPX rawcooking STARTED  ===================================================== $date_FULL" >> ${mkv_destination_qnap}rawcook.log

# Find all folders in the source path, then pass random 20 to GNU Parallel job to rawcook with --check (to check reversibility) -y (to suspend interaction)
# --accept-gaps (to allow for breaks in sequence numbering), -o path to MKV destination and output to log file

find ${dpx_source}dpx_to_cook/ -maxdepth 1 -type d -name "N*" | shuf -n 20 | while IFS= read -r folders; do
  folder_clean=$(echo "$folders" | cut -c 29-)
  count_cooked=$(cat ${mkv_destination_qnap}rawcooked_success.log | grep -c "$folder_clean")
  count_queued=$(grep -c "$folder_clean" ${mkv_destination_qnap}temp_queued_list.txt)
  echo "$folder_clean: count cooked = $count_cooked, count_queued = $count_queued" >> ${mkv_destination_qnap}rawcook.log
  if [ "$count_cooked" -eq 0 ] && [ "$count_queued" -eq 0 ];
   then
    echo "$date_FULL DPX folder will be cooked: $folders" >> ${mkv_destination_qnap}rawcook.log
    echo "$folder_clean" >> ${mkv_destination_qnap}temporary_rawcook_list.txt
   else
    echo "$date_FULL Skipping DPX folder, it is already cooked or being cooked: $folders" >> ${mkv_destination_qnap}rawcook.log 
  fi
done

# Sends the temporary rawcooked list to parallel to cook multiple RAWcook jobs 
cat ${mkv_destination_qnap}temporary_rawcook_list.txt | parallel --jobs 4 "rawcooked --check -y --accept-gaps /mnt/isilon/dpx_to_cook/{} -o ${mkv_destination_qnap}mkv_cooked/{}.mkv &>> ${mkv_destination_qnap}mkv_cooked/{}.mkv.txt"

# Write an END note to the logfile
echo "========== DPX rawcooking ENDED  =================================================== $date_FULL" >> ${mkv_destination_qnap}rawcook.log

Post_rawcook.sh

This next script is post_rawcook.sh, and has more elements within it as it incorporates analysis of the cooked MKV files, and assessment of the rawcook.sh logs, in additional to my MediaConch policy comparison. The rawcook.sh logs can include specific error messages unique to RAWcooked, such as “Reversability was checked, issues detected”, “Error: undecodable file (can not be open)”. The MKV files that pass all the checks are finally moved to an ‘autoingest’ folderwhere a separate Python script picks them up and starts the DPI ingest process.

#!/bin/bash

# Create full timestamp variable for use in logging below
date_FULL=$(date +'%Y-%m-%d  - %T')

# ===============================================================
# === Source and destination folder paths as variables ==========
# ===============================================================

# = Variables for Isilon and BFI QNAP rawcooked folder
dpx_source="/mnt/isilon/"
mkv_destination_qnap3="/mnt/qnap/Public/rawcooked/"
mkv_autoingest_qnap2="/mnt/qnap/Public/autoingest/"

#  Write a START note to the logfile
echo "========== Post-rawcook workflows STARTED  ================================================= $date_FULL" >> ${mkv_destination_qnap3}post_rawcooked.log

# ====================================================================
# FOR ====MEDIACONCH POLICY FAILURES==== Remove fails to Killed folder
# ====================================================================

# MediaConch policy checks for missing duration, bitrate >300Mb/s, slices >16 and many more.

find ${mkv_destination_qnap3}mkv_cooked/ -name "*.mkv" -mmin +20 | while IFS= read -r files; do
check=$(mediaconch --force -p /mnt/isilon/rawcooked/mkv_policy.xml "$files" | grep "fail")
filename=$(basename "$files") 
  if [ -z "$check" ];
    then
      echo "*** RAWcooked MKV file $filename has passed the Mediaconch policy. Whoopee ***" >> ${mkv_destination_qnap3}post_rawcooked.log
    else
      {
        echo "*** FAILED RAWcooked MKV $filename has failed the mediaconch policy. Grrrrr ***"
        echo "*** Moving $filename to killed directory, and amending log fail_${filename}.txt ***"
        echo "$check"
      } >> ${mkv_destination_qnap3}post_rawcooked.log
      echo "$filename" > ${mkv_destination_qnap3}temp_mediaconch_policy_fails.txt
  fi
done

grep ^N ${mkv_destination_qnap3}temp_mediaconch_policy_fails.txt | parallel --progress --jobs 10 "mv ${mkv_destination_qnap3}mkv_cooked/{} ${mkv_destination_qnap3}killed/{}"
grep ^N ${mkv_destination_qnap3}temp_mediaconch_policy_fails.txt | parallel --progress --jobs 10 "mv ${mkv_destination_qnap3}mkv_cooked/{}.txt ${mkv_destination_qnap3}logs/fail_{}.txt"

# =============================================================
# FOR ====PASS==== - i.e. successful RAWcooked --check outcomes
# =============================================================

# Check for presence of PASS cases, process those, and add to log files
count=$(grep -c "Reversability was checked, no issue detected" ${mkv_destination_qnap3}mkv_cooked/*.mkv.txt | grep -c "1")
if [ "${count}" -gt 0 ];
 then
   # Check in the .mkv.txt files for the -no issue detected- rows and output them to the rawcooked_success.log
   grep -l "Reversability was checked, no issue detected" ${mkv_destination_qnap3}mkv_cooked/*.mkv.txt | rev | cut -c 9- | rev >> ${mkv_destination_qnap3}rawcooked_success.log

   # Log the moves of MKV into autoingest
   echo "*** No reversibility issues detected with these Matroska files, moving them into autoingest ***" >> ${mkv_destination_qnap3}post_rawcooked.log
   grep -l "Reversability was checked, no issue detected" ${mkv_destination_qnap3}mkv_cooked/*.mkv.txt | rev | cut -c 9- | rev >> ${mkv_destination_qnap3}post_rawcooked.log
 
   # Move the mkv into ingest workflow
   grep -l "Reversability was checked, no issue detected" ${mkv_destination_qnap3}mkv_cooked/*.mkv.txt | rev | cut -c 5- | rev | cut -c 55- | parallel --progress --jobs 10 "mv ${mkv_destination_qnap3}mkv_cooked/{} ${mkv_autoingest_qnap2}{}" 
 
   # Log the dpx folders that are being moved
   echo "*** No reversibility issues detected with these DPX sequences, moving the folders into /mnt/isilon/dpx_cooked/ ***" >> ${mkv_destination_qnap3}post_rawcooked.log
  
   grep -l "Reversability was checked, no issue detected" ${mkv_destination_qnap3}mkv_cooked/*.mkv.txt | rev | cut -c 9- | rev >> ${mkv_destination_qnap3}post_rawcooked.log
 
   # Move the dpx folders from dpx_to_cook folder into dpx_cooked, otherwise they will be cooked again on the next pass.
   grep -l "Reversability was checked, no issue detected" ${mkv_destination_qnap3}mkv_cooked/*.mkv.txt | rev | cut -c 9- | rev | cut -c 55- | parallel --progress --jobs 10 "mv ${dpx_source}dpx_to_cook/{} ${dpx_source}dpx_cooked/{}"
 
   # Move the txt files to logs folder
   grep -l "Reversability was checked, no issue detected" ${mkv_destination_qnap3}mkv_cooked/*.mkv.txt | cut -c 55- | parallel --progress --jobs 10 "mv ${mkv_destination_qnap3}mkv_cooked/{} ${mkv_destination_qnap3}logs/{}" 
 
 else
   # Output 'no passes to deal with' message to log
   echo "*** No successful Matroska files to move into autoingest this time ***" >> ${mkv_destination_qnap3}post_rawcooked.log
fi

# ===============================================================
# FOR ====FAIL==== - i.e. unsuccessful RAWcooked --check outcomes
# ===============================================================

# Check for presence of FAIL cases, process those, and add to log files
count=$(grep -c "Reversability was checked, issues detected\|Error: undecodable file (can not be open)\|Error: undecodable cannot open file for reading\|Conversion failed!\|Error: unsupported DPX Number of image elements\|Error: untested multiple slices counts" ${mkv_destination_qnap3}mkv_cooked/*.mkv.txt | grep -v -c ':0')
if [ "${count}" -gt 0 ];
 then
   echo "*** Reversibility or encoding issues were detected with these Matroska files, deleting them to allow RAWcooked to try again ***" >> ${mkv_destination_qnap3}post_rawcooked.log	
   grep -l "Reversability was checked, issues detected\|Error: undecodable file (can not be open)\|Error: undecodable cannot open file for reading\|Conversion failed!\|Error: unsupported DPX Number of image elements\|Error: untested multiple slices counts" ${mkv_destination_qnap3}mkv_cooked/*.mkv.txt | rev | cut -c 5- | rev >> ${mkv_destination_qnap3}post_rawcooked.log
	
   # Delete the Matroskas
   grep -l "Reversability was checked, issues detected\|Error: undecodable file (can not be open)\|Error: undecodable cannot open file for reading\|Conversion failed!\|Error: unsupported DPX Number of image elements\|Error: untested multiple slices counts" ${mkv_destination_qnap3}mkv_cooked/*.mkv.txt | rev | cut -c 5- | rev | parallel --progress --jobs 10 "rm {}"
	
   # Move the txt files to logs folder and prepend -fail- to filename
   grep -l "Reversability was checked, issues detected\|Error: undecodable file (can not be open)\|Error: undecodable cannot open file for reading\|Conversion failed!\|Error: unsupported DPX Number of image elements\|Error: untested multiple slices counts" ${mkv_destination_qnap3}mkv_cooked/*.mkv.txt | cut -c 55- | parallel --progress --jobs 10 "mv ${mkv_destination_qnap3}mkv_cooked/{} ${mkv_destination_qnap3}logs/fail_{}"
   
 else
   # Output 'no fails to deal with' message to log
   echo "*** No failed Matroska files to delete this time ***" >> ${mkv_destination_qnap3}post_rawcooked.log
fi

# ===============================================================
# FOR ==== INCOMPLETE ==== - i.e. killed processes ==============
# ===============================================================

# This block manages the remaining INCOMPLETE cooks not removed by the MediaConch policy
# It uses find mmin +1440 to find txt files not modified in last 1440 minutes, then deletes the text file, and associated reversibility and mkv files

# Check for presence of stale .mkv.txt files remaining after processes above, using mmin to query modified date minutes > 1440
count_old=$(find ${mkv_destination_qnap3}mkv_cooked/ -name "*.mkv.txt" -mmin +1440 -size +10k | grep -c ".mkv.txt")
if [ "${count_old}" -gt 0 ];
 then
   echo "*** Files found indicating stalled or killed process, deleting them to enable reprocessing ***" >> ${mkv_destination_qnap3}post_rawcooked.log	
   
   # Add to log file
   find ${mkv_destination_qnap3}mkv_cooked/ -name "*.mkv.txt" -mmin +1440 -size +10k | rev | cut -c 9- | rev >> ${mkv_destination_qnap3}post_rawcooked.log
   
   # Delete any MKVs that exist alongside these stalled text files
   find ${mkv_destination_qnap3}mkv_cooked/ -name "*.mkv.txt" -mmin +1440 -size +10k | rev | cut -c 5- | rev | parallel --jobs 10 "rm {}*"

   # Delete the files
   find ${mkv_destination_qnap3}mkv_cooked/ -name "*.mkv.txt" -mmin +1440 -size +10k | parallel --jobs 10 "rm {}*"
	
 else
   echo "*** No incomplete files to delete this time ***" >> ${mkv_destination_qnap3}post_rawcooked.log
fi

# Write an END note to the logfile 
echo "========== Post-rawcook workflows ENDED  =================================================== $date_FULL" >> ${mkv_destination_qnap3}post_rawcooked.log

# Update the count of successful cooks at top of the success log  
# First create new temp_success_log with timestamp
echo "==== Updated $date_FULL =====================================" > ${mkv_destination_qnap3}temp_rawcooked_success.log

# Count lines in success_log and create count variable, output that count to new success log, then output all lines with /mnt* to the new log
grep "/mnt/" ${mkv_destination_qnap3}rawcooked_success.log >> ${mkv_destination_qnap3}temp_rawcooked_success.log
success_count=$(grep -c "/mnt/" ${mkv_destination_qnap3}temp_rawcooked_success.log)
echo "==== Successful cooks: $success_count ==============================================" >> ${mkv_destination_qnap3}temp_rawcooked_success.log

# Sort the log and remove any non-unique lines
sort ${mkv_destination_qnap3}temp_rawcooked_success.log | uniq | sort -r > ${mkv_destination_qnap3}temp_rawcooked_success_unique.log
  
# Move the new log renaming it to overwrite the old log
mv ${mkv_destination_qnap3}temp_rawcooked_success_unique.log ${mkv_destination_qnap3}rawcooked_success.log

#AVpres bash script repositories

Now you have an eye for bash scripting why not take a look at some of the the remarkable collections of scripts available to the archiving community. If you can, make a repository of your own and borrow (or fork) some of these collections across to yours making changes to suit you. This kind of collaboration is incredibly useful to other archivists and the developers of the scripts themselves – particularly if you find new ways to reinvent old scripts and give them new purpose.

The biggest and best #avpres bash repository I know of belongs to the City University New York Television’s micromediaservices by Dave Rice and friends. These include over 50 really useful and instructive scripts like makelossless which generates a lossless Matroska FFv1 or JPEG 2000 file, viruscheck which uses clamav to scan files for viruses, and makemkvchapters which creates ordered chapters in a Matroska file. There’s also some amazing Open Archival Information Service (OAIS) scripts for making Archival Information Packages (AIP) and scripts to make derivatives such as you’d use in Dissemination Information Package (DIP).

The GitHub page includes micromediaservices documentation with installation guides and full descriptions of all the scripts and what they do. They’re an amazing resource for anyone working with media files in archiving. Again, I highly recommend you take a look at the amazing Microservices in Audiovisual Archives by Annie Schweikert and Dave Rice, which discusses bash script potential within archives.

Another GitHub repository worth a visit is Puget Sound and Vision’s which has separate audiotools and videotools repositories. The audiotools include AudioAIP which ‘creates an archival package that adheres to the bagit standard with a mezzanine file, an access file, technical metadata and checksums.’ Similarly videotools includes a VideoAIP which does the same but for a video file. The beauty of shell scrips now you understand them, is you can edit the FFmpeg settings to suit you own needs. Do make sure you set up a safe test environment before setting scripts loose on larger collections though If I come across any more I’ll add them here.

With thanks

Many thanks to the BFI for allowing the scripts to be made public. Particular thanks to Stephen McConnachie, Brian Fattorini, Jérôme Martinez, Kieran O’Leary, Dave Rice, Ashley Blewer, Paul Mahol, Reto Kromer, Peter Bubestinger-Steindl, Andrew Sargeant, Lucy Wales, Michael Norman and many more who’ve helped and guided me. Particular thanks to all those involved in the No Time To Wait! Symposiums who have been so encouraging. Check out the Zulip group #AVhackers which has been set up for Digital Preservation practitioners to discuss anything digital preservation – there’s a bash script feed where you can bring problems or ask questions! Sign up by following the link in my AVhackers blog here. Feedback is very welcome. Thanks for visiting!

Links

Many of these links appear in the text above but I’ll list them all here again to save you searching for them. They’re in no particular order!

Shell script website full of information: https://www.shellscript.sh/
Bash man page online: http://manpages.ubuntu.com/manpages/bionic/man1/bash.1.html
RAWcooked software: https://mediaarea.net/RAWcooked
For an introduction to using RAWcooked take a look at my earlier RAWcooked workflow post here.
MediaConch from MediaArea: https://mediaarea.net/MediaConch
MediaConch online policy maker: https://mediaarea.net/MediaConchOnline/
FFmpeg, so good it always has to have a link: https://ffmpeg.org/
FFmprovisr has FFmpeg bash commands perfect to incorporate into your archival scripts: https://amiaopensource.github.io/ffmprovisr/
Check out all the amazing NTTW Symposium videos on YouTube: https://www.youtube.com/channel/UC-NF6EF-tN0S0FrJUD20-ww/playlists
Join #AVhackers on Zulip, a casual space to ask questions and chat about all things Digital Preservation. Sign up details can be find in my blog #AVhackers launces on Zulip.
Visual Studio Code: https://code.visualstudio.com/
Nano editor for Terminal: https://www.nano-editor.org/
Flock lock: https://ma.ttias.be/prevent-cronjobs-from-overlapping-in-linux/
An amazing syntax checking website to perfect your script: https://shellcheck.net
Configuring crontab: https://www.howtogeek.com/101288/how-to-schedule-tasks-on-linux-an-introduction-to-crontab-files/
Cronic the cron email manager: https://habilis.net/cronic/
Crontab generator: https://crontab-generator.org/
Crontab guru: https://crontab.guru/
Running bash shell scripts on Windows 10: https://www.howtogeek.com/261591/how-to-create-and-run-bash-shell-scripts-on-windows-10/
How to set up email on a virtual machine running Ubuntu:
https://help.ubuntu.com/lts/serverguide/exim4.html
https://help.ubuntu.com/lts/installation-guide/armhf/ch08s05.html
bc precision calculator programme: https://www.gnu.org/software/bc/manual/html_mono/bc.html
Ashley Blewer’s Bash training, one of many useful intro slides: https://training.ashleyblewer.com/
Stackoverflow, answer to all your search engine questions: https://stackoverflow.com/
Microservices in Audiovisual Archives, Dave Rice & Annie Schweikert: http://journal.iasa-web.org/pubs/article/view/70
CUNY micromediaservices GitHub repository: https://github.com/mediamicroservices/mm
Netdata real-time performance monitoring: https://github.com/netdata/netdata
GNU Parallel: https://www.gnu.org/software/parallel/
The difference between glob and regex: https://www.linuxjournal.com/content/globbing-and-regex-so-similar-so-different

The MIT License (MIT)

Copyright (c) 2020 British Film Institute

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the “Software”), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

One thought on “Using bash scripts to automate #AVpres workflows

Leave a comment