Open source scripts from the BFI National Archive

Hello! I wanted to share a few GitHub repositories which have been created in my new role as Collections and Information Developer at the BFI National Archive – and which use open source software from the wonderful Media Area, FFmpeg and more!

The scripts featured are a mix of bash shell scripts and Python, and I’ve tried to explain in README.md files and comment the code as much as possible for anyone new or learning how to code. They’re all available under the MIT Licence so please feel free to fork or experiment with them – but please do so in a safe environment away from any preservation critical files.

The three repositories featured here are just a few of a growing collection available from the BFI Data and Digital Preservation GitHub, like Stephen McConnachie’s BFI Southbank Programme Notes website repository. I hope to prepare more Python scripts to share to open GitHub respositories in coming months, including our Python FFV1 matroska to V210 mov transcoding script.

With thanks to Stephen and colleages at the BFI National Archive for their excellent support and open source ethos. Thanks for stopping by and any questions or feedback is welcomed. Don’t be shy!

DPX encoding using RAWcooked

https://github.com/bfidatadigipres/dpx_encoding

The shell scripts for the BFI’s National Archive DPX workflows have had a little overhaul recently, so as a follow up to my previous blog Using bash scripts to automate #AVpres workflows I’m sharing the updates and new additions to their own GitHub repository.

In my previous position here at the BFI as Digital Preservation Data Specialist it was my full time job to manage RAWcooked encoding for a digitisation project. This project’s aim is to encode all 3PB of BFI DPX sequences to FFV1 Matroska video files and ingest into Digital Preservation Infrastructure from legacy LTO storage. Now in my new role as C&I Developer I’m seeking to make these processes easier to manage, by refactoring the original scripts to handle encoding difficulties, and by introducing new scripts that enable greater automation of key stages of the workflow.

In addition, Media Area have recently released RAWcooked version 21.01, which brings some helpful new flag commands making it even easier to safely encode DPX (and TIFF) sequences, so make sure you upgrade your RAWcooked software and check out their manual and help pages for more information.

To find out more about these scripts and how they function please take a look at the repository’s README.md which gives detailed description of the script functions, operation environments required, crontab configuration, software dependencies and information about environmental variables. These scripts are still in development and test, so please watch out for the odd problem I’ve not resolved yet!

Title article splitting scripts

https://github.com/bfidatadigipres/title_article_split

During the creation of several scripts that write film and television titles to BFI National Archive collections database records, a title article splitting recipe has evolved into the scripts found here. These scripts allow for language ISO codes (alpha-2) to be paired with roman alphabet titles, so that definite and indefinite articles can be identified and separated from the whole titles.

The scripts here include title_article.py which I import into many other scripts, calling the title_article.splitter(“The Red Shoes”, “en”) function multiple times. Each use it looks up all available dictionary entries for the supplied language articles, string matches to the title, separates any present and returns them, eg article “The” and “Red Shoes”. It also handles articles that are joined as with French, Arabian and Italian titles, for example “L’atlante”, returns “L'” and “Atlante”.

For a quick test now you can try title_article_input.py which allows you to test individual titles via command line by using the command. It works in exactly the same way but main() passes the command line title and language inputs to the splitter function input. You can run this script using one line of code:
python3 title_article_input.py "L'atlante" "fr"

This repository also includes a Pytest test module for the title_article.py script. It tests the splitter() function against a number of regular and ridiculous title inputs to ensure that all potential outputs are formatted correctly, and that errors are handled successfully. To use this test script you will need to install Pytest, and all the details to use if can be found in the README.md.

It was really great to work with the BFI Information team to create these scripts, they were instrumental in defining the international dictionary of articles. There are still lots of languages/variations to incorporate though so I hope that this script can develop further over time and with help from international friends! But for the time being it’s doing a splendid job tidying up titles and moving them into the BFI database.

BFI Checksum scripts

https://github.com/bfidatadigipres/checksum_scripts

In recent months we’ve been running checksum speed comparisons, with an aim to reducing bottlenecks caused by an increasing volume of digital media files. One such bottleneck was caused by our use of hashlib in Python2 scripts to generate MD5 checksums for every media file before being written to LTO tape storage. We recently ran some comparisons between CRC32 and MD5 as the most likely fastest options supported in our data tape library system (it supports cryptographically secure hash types, but we were aiming for speed).

The scripts in this repository use Python standard library zlib and hashlib to generate CRC32 and MD5. They both use timeit to measure the speed that it takes to run each checksum pass. This repository has two versions of the checksum_speed_test script that allow for single use checksum testing or automated testing of directories, and both will run on Python2.7 or Python3+.

Thanks to the BFI National Archive Collections Systems Manager Rob Scott for his assistance publishing the results of the tests, which we analysed over a week. I’m delighted to say we’ve now moved our checksum generation to separate Python 3 MD5 generation scripts, distributed across multiple servers and this bottleneck has been eradicated!

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s