DPX preservation workflow with RAWcooked and fixity checking

Since acquiring a Golden Eye 4 (GE4) scanner here at the Media Archive for Central England (MACE) we’ve developed a DPX preservation workflow which I’ve decided to share here, in the hope of feedback and collaboration.  Our first DPX scans would be wrapped as uncompressed sequences using 7-Zip in a TAR archive and written to standalone LTO decks (read more about MACE’s LTO workflow here). This workflow changed recently following a period of testing of MediaArea’s RAWcooked software, making DPX compression into an FFV1 Matroska (.mkv) audiovisual file possible.  It is this new workflow I will share here in the following sections:

  • So what is RAWcooked?
  • Using command line
  • Tools and installation
  • Our DPX preservation workflow
    – Directory structures
    – Automate or manual workflow?
    – RAWcooking and demuxing
    – File fixity checks
    – MD5 checksum generator
    – Copying to LTO using Python copyit.py
    – Automating copyit.py [UPDATED NOV 2019]
  • A few final snags

So what is RAWcooked?

RAWcooked is an amazing tool which takes raw audio-visual image sequences and encodes them into a lossless video stream, reducing the overall storage size by between one and two thirds.  FFmpeg is used to encode the audiovisual data into a Matroska container using the video codec FFV1, and audio codec FLAC.  FFV1 codec rather brilliantly compresses sequences by discarding repetitious binary data from the files while storing this difference so that the file can be restored perfectly at a later date.

The metadata accompanying the raw DPX sequence is fully preserved, along with additional sidecar files such as MD5 checksums, LUTs or XMLs if provided. The lossless Matroska video stream can be played by VLC or MPV media players, and writing and retrieving from storage devices such as LTO is significantly quicker. If you need to use the RAW source in its original form, one line of code will easily restore it bit-by-bit, faster than retrieving the same file from LTO tape storage. Most imporatantly for us, the average size reduction from MACE’s GE4 scanner is approximately 50%, a significant saving when handling multiple 16mm reels of 400ft 4K scans at around 700-750GB uncompressed.  This soon fills an LTO6 tape with just 2.4TB capacity, so reducing it by half without any loss of data is an amazing gift to our MACE archive.

I started beta testing with RAWcooked over a year ago generating some fairly simple software breaks for the developer Jérôme Martinez, and raising Issues which you can view on their GitHub. In addition to just testing the software I also wrapped/demuxed a couple of sequences, replacing the original DPX sequence into an existing Da Vinci Resolve project without any problems, and exported a graded H264 MOV using the demuxed scan. You may think this test phase isn’t necessary but I would certainly recommend you engage with these processes. I still have lots to learn about this software but testing and participating in GitHub conversations is a great start.

Screen Shot 2019-10-30 at 11.38.14.pngFor a more thorough explanation of how FFV1 compresses DPX data I recommend you read Introduction to FFV1 and Matroska for film scans by Kieran O’Leary (image right from blog). In addition the blog explains how FFV1 generates CRC32 checksums for every slice of a frame, and how to perform a fixity check of the FFV1 file once you’ve RAWcooked your file to ensure there are no faults with the encode.

License Key
To operate the software fully you will need to request a license key and the first time you run the software you’ll need to specify the key by typing:

rawcooked --store-license XXXXXXXXXXXXXXXX

The license information will be stored in a RAWcooked config file, and accessed each time you run the software. You can recall this license data at any time, including a list of the codecs your license supports, by typing:

rawcooked --show-license

Free flavours
If your archive has a tight budget then the free RAWcooked default flavours might considerably influence your scan setting choices. We’re lucky at MACE that our own scan choice is aligned to these free flavours.  Right now the default supported input flavours for RAWcooked include DPX 8 and 10-bit RGB, and audio supported is PCM 16-bit 2 channel LE in a WAV, BWF, RF64 and AIFF wrapper.  Supported output flavours include FFV1 version 3 and FLAC audio in a Matroska wrapper.

Screen Shot 2019-10-25 at 11.22.12.png

Screen Shot 2019-10-25 at 11.22.36.pngScreen Shot 2019-10-25 at 11.22.24.png

During the set up phase our test scans were captured in RGB 12-bit Packed LE which isn’t currently available as an option in RAWcooked, so these files aren’t able to be wrapped at this time. I have an issue raised on the GitHub which RAWcooked developer Jérôme Martinez has assigned himself to work on in coming months making this new flavour available.  It’s possible MACE will have a chance to contribute to RAWcooked financially in the future and that will allow us to wrap a few of these older assets.  Thankfully, following advice from Golden Eye we made the choice to switch to RGB 10-bit (before we were fully aware of the RAWcooked bonus) and this has worked out very well for us.  It shouldn’t be the only reason you consider scanning 8 or 10-bit RGB, but the free supported flavours and great storage advantages to using RAWcooked makes it a very appealing choice, for a financially challenged institution.

For additional context about RAWcooked here’s MediaArea’s Jérôme Martinez speaking at FOSDEM last year, discussing how RAWcooked works. You might also want to check out the MediaArea.net events  for the latest slides from his recent talks.

Using command line

It’s possible you’ve come to this blog without any knowledge of working with a command line interface (CLI), such as MacOS’s Terminal and Windows’ Command Prompt. A CLI is a text only window in a simple graphical user interface (GUI). Software such as Adobe Premiere use a complex GUI that allow you to edit and export video by pushing buttons instead of coding instructions directly to the computer. Each CLI has a programme called a shell, or command interpreter, running within it which reads the commands and then executes them. For MacOS Terminal the shell is called BASH, or Bourne Again SHell, and can been seen at the top of the Terminal window when there is no command running as “~ — -bash”.  When issuing a command into bash you’re telling a computer to do something in its own language. For example “mkdir” command means make directory (folder is another name for a directory).

Command line enables you to install and use open source tools that do not have GUIs, such as FFmpeg and RAWcooked. You can easily automate commands, so the computer can do some of your work for you using for loops or recursive loops. If accessing computers remotely you often have no access to a GUI, so you have to know command line to talk ‘in the dark’.  It is easier to drag and drop files or directories into the command line stream and this will ensure your path is correctly formed. You can type it too, but drag and drop saves loads of time. If you don’t set a path for an output file then Terminal always saves it to the location that the command line is in already, which you can discover by calling Print Working Directory like this, and pressing return:

pwd

To change from this directory just type ‘cd’ and drag and drop the new location.

cd /path_to_directory/

Code is really pedantic, it has to be exact or it won’t be happy, so you swiftly learn how to spot errors and omissions in your typing. It’s nearly impossible to make a typo with your code and significantly damage something, if you get the smallest digit wrong it will fail to run and execute. Also, if you’re accidentally copying over something with the same name you will be prompted to answer “y/N” to ensure it is a deliberate action. Watch out particularly for mistakes with spaces, missing them or too many, the number one 1 and the letter L in lower case, or single quote marks ‘ being mistaken for the ` grave accent.

Gradually over weeks and months of successful experimenting you will wish that you had engaged with command line earlier. I can’t wait to launch Terminal when I get to work and I try to find jobs that require my engagement with command line so I can keep developing my skill set.  For more training look at this great CLI presentation by Ashley Blewer which has loads of other commands to get you started.

Once you have a few CLI basics under your belt, I recommend you visit AMIA Open Source’s FFmprovisr, a repository of FFmpeg commands for archivists which you can contribute to using their GitHub page. It provides nice insight into some of the features of FFmpeg, describing Filtergraphs and Default codec settings etc.  Knowledge of FFmpeg will help with your RAWcooked use, as the encoding processes uses FFmpeg.

Tools and installation links

The easiest way to install open source GitHub softwares to your MacOS or Windows system is using package managers:

Screen Shot 2019-11-08 at 13.49.56

Package managers automate the process of installing, upgrading, configuring, and removing computer programs for a computer’s operating system in a consistent manner. They typically maintain a database of software dependencies and version information to prevent software mismatches and missing prerequisites. When I’ve installed Homebrew to MacOS before it’s required an additional install of Xcode, which Homebrew installs itself but does require you to say ‘yes’ to. I’m unsure of requirements for Chocolatey as I’ve never installed this way to a PC, but I’ve read it’s a similar process to Homebrew.

Once the package manager is installed you can access full operating details for Homebrew (or Chocolatey by changing command to ‘choco’) by running:

man brew to run the manual, press q to exit
brew -h  to view the help pages

Homebrew / FFmpeg and other open source tools
There’s a little bit of a workload increase when committing to open source workflows such as RAWcooked – but its worth it for access to such amazing archival tools. There’s some crossover here with my blog FFV1 video capture workflow so I recommend you read the Homebrew, FFmpeg installation sections which have all the links you need to set these up. You might want to read the general guide to using open source softwares and project development on GitHub too. It’s not a definitive guide (check out Ashley Blewer training pages for that), but I hope some of my own discoveries will be of help to you. You might also want to download MPV player and ShotCut open source video editor, as these are really useful free tools that will help with asset assessment and pre and post-compression editing.

RAWcooked
As with the other open source softwares, RAWcooked needs FFmpeg installing beforehand.  You can access full installation information at MediaArea website, where there’s a ‘Download RAWcooked’ pop down with every operating system you’ll probably ever need to install to.  I’ve only installed it to Windows and MacOS so far, and both times it’s been a really simple process.  In the past I’ve run the software from its own directory on Windows, but in recent months I’ve edited the Windows PATH Environment Variable (read how to here) to be able to call the software in any directory. This is a necessary step if you ever want to use RAWcooked with Python scripts such as the IFIscripts seq2ffv1.py.  You can also install RAWcooked with Homebrew package manager if you’re on a Mac by typing the code below, which will automatically organise the PATH Environment Variable for you.  [Note: The Windows package manager Chocolatey doesn’t have RAWcooked as a download option yet, but Jérôme tells me that it is on its way!]

brew install rawcooked

Screen Shot 2019-10-29 at 13.08.04

The current software 18.10.1 will work without a paid license as long as you are using the default free flavours. If you want to compress non-default flavours then you will need to request a license. You can also ask for a test license, skip ahead to purchasing a full license from MediaArea.net, or commission a development of a flavour that’s not already supported.  Once installed make sure you check for upgrades fairly regularly. This is easy with a Homebrew install, just type the following command:

brew upgrade rawcooked

To follow the project changes more closely, and perhaps contribute to its development follow the GitHub page here.  It has plenty of background information about the project, and is bubbling with ideas about future developments, which I encourage you to participate in by signing up to GitHub!

MD5tool.py
To make MD5 manifests I have traditionally used a python code by GitHub user California Revealed, called MD5tool.py. This script uses the Python module hashlib to generate a manifest which is stored alongside each file in a given directory.  To download it visit the MD5tool GitHub page and follow the install instructions on the page’s ReadMe. I downloaded and unzipped the software to my User directory, from where I operate the software using Terminal / Command Prompt.

Copyit.py and Seq2ffv1.py
Please take a look at installation instructions from my DIY Python microservices post which features an introduction to these and other IFIscripts, including installation guides and links to the official documentation. It’s worth knowing more about this amazing multi-award winning collection of scripts by Kieran O’Leary.

Our DPX preservation workflow

Dark_room

Once the GE4 has scanned a film it’s given a unique number from MACE’s database and placed in a shared directory called ‘Captures’. From here the file is loaded into the grading suite’s HP Z840 and using Da Vinci Resolve the film is added to a full HD timeline, resized, speed adjusted, colour graded, combined with equalized audio and an H264 intraframe video is exported for mezzanine storage and client supply.  Following the completion of this the DPX scan is packaged in preparation for copying to LTO, and this begins our DPX preservation workflow.

MACE directory structuring
I thought it might be helpful to give a little overview of the way we layout the directories for our scan data.  The Scans are moved to a new directory called ‘DPX for Processing’, where they are shaped into a loose Archival Information Package (or AIP, read more about OAIS microservices here).

Screen Shot 2019-10-29 at 11.21.04It starts with a top directory, with the same unique number as the DPX scan allocated from the MACE database, eg 23456.  Inside are two more directories: the first is ‘Project’ which contains the exported files from Da Vinci Resolve (a FCPXML, XML, EDL, colour EDL and the Resolve project file); the second is named after the scanned film, eg ‘[ATV Today. 01.01.1970. News Today] Reel 01 of 01’, and inside this is another directory with the scan dimensions, such as ‘2048 x 1536’.  Inside this final directory the DPX scans are relocated and named ‘Scan01’, ‘Scan02’ depending on the quantity of captures made for each item. It’s these DPX scan directories that I use for the RAWcooked preservation wrapping.

I don’t always copy the DPX scans directly into their preservation directory in this way. On a few occassions I’ve only edited and graded a short clip from a lengthy 2K or 4K scan and haven’t wanted to preserve the entire DPX sequence. This is mostly when a recapture is taken for bouncy frame lines or for a short section of colour faded film. In this instance I would use Da Vinci Resolve’s Media Management function to trim and move only the section used on the timeline, leaving the unused DPXs in the original ‘Captures’ directory. This method can interrupt the numbering sequence for your DPX scan though, but this won’t interfere with RAWcooking.

If the film has sound, then I will place an exported file in this top directory as a .wav, alongside the ‘Project’ directory. At the moment we only keep an exported wav file from the edited and equalised timeline. This makes RAWcooking more straightforward as it’s just the DPX to FFV1 process we require, no FLAC encoding.  Because of this, I have very little experience of wrapping DPX and audio files within a RAWcooked Matroska file. I remember testing with some integrated audio in the early test stages and can’t recall any difficulties so I trust you will find it the same.

This prep may all seem like a faff but the purpose is to speed up recovery access should the worst happen and MACE loses any mezzanine level files. We have already had one Drobo fail here at MACE, thankfully the back up is okay. Any project we remove from LTO tape will have everything in the package you need to rebuild it quickly and easily for speedy file regeneration.  The grade, audio edit and DPX scans are all neatly organised and in locations linkable to the enclosed project files.  I guess the only problem will be Resolve’s inevitable software obsolescence, but I hope the additional EDLs, XMLS, CDLs etc will provide options for quick project rebuilds. And there will probably be an awesome open source alternative available in years to come.

Automate or manual workflow?
Because RAWcooked is a relatively new piece of software many users I know are still checksum validating their cooked files, as we do here at MACE. It may be that this always remains the case for archival storage, but in an automated sense. What this means is RAWcooking the Scan01 directory, demuxing it again, generating md5 checksums of both pre and post-cooked directories and running a comparison of the two md5s to ensure they are exactly the same.  I also run a CRC-64 checksum of each of the pre and post-cooked directories to ensure everything else is correct including directory name etc. There are checksum processes built into RAWcooked that do this as well, but it would be hard to judge if these have failed from a user’s perspective as the software is deliberately kept user friendly with minimal CLI interaction.  As an extra failsafe it’s nice to know your files are exact copies after rebuilding them, giving you greater confidence to delete that Scan01 directory after you’ve RAWcooked it.

The quickest method with this kind of workflow is to use an automated script such as the python scripts belonging to IFI Irish Film Archive, like seq2ffv1.py.  It uses RAWcooked to package a DPX (or TIFF) sequence and also exports a series of valuable sidecar files including checksum files, FFmpeg encode logs and metadata files and packages them all into an IFI Irish Film Archive specific Archival Information Package (AIP).  This export structure is very IFI centric, and may need adjusting after each conversion to suit your own needs – or if you have Python skills you can try editing the script. I’ve written about this script in more detail in my DIY Microservices post so if you fancy the challenge take a look at automating your workflow. You can read more about seq2ffv1.py and all the IFIscripts here in the official documentation.

IMG_0134

For the remainder of this post I will focus on the manual approach, which will probably be easier to help you start experimenting and testing with RAWcooked right away.  Our manual RAWcooked workflow is fairly straightforward, though time consuming. As the image above shows I like to have multiple windows open to increase efficiency. It can look a bit intense but each Windows Command Prompt is working on one repeat task alongside the relevant DPX directory so that you don’t lose track of the processes. Following are commands that will work on Linux, MacOS and Windows as RAWcooked, FFmpeg and Python are all standardised across operating platforms.

RAWcooking and Demuxing
If you’ve installed FFmpeg and RAWcooked, edited the PATH Environmental Variable  then the first step is to take your DPX scan directory and run a RAWcooked wrap on it by typing a really simple line of code below. If you haven’t edited the PATH Environmental Variable you can still run RAWcooked from the directory it sits in by using the ‘cd’ command to navigate into the RAWcooked directory that contains the RAWcooked app. Find out more about editing paths for command line MacOS here and Windows here.

rawcooked /path_to_directory/Scan01

The wrapping process will begin, the first signs of this being the appearance of a RAWcooked reversibility data file alongside your DPX scan directory. This disappears when the process completes and you’ll be left with your DPX scan directory, and a FFV1 Matroska file probably around half its size.  If there’s an audio track in this stream it will be incorporated into the finished Matroska wrapper as a FLAC encoded track, with the option of switching this to PCM (see the help page for correct command instructions, and you’ll need an additional license to export to PCM).  After I’ve RAWcooked a directory and the Matroska is sat next to the DPX scan directory I immediately begin a demux process, by running RAWcooked on the Scan01.mkv file this like:

rawcooked /path_to_directory/Scan01.mkv

This will generate a second directory alongside the original Scan01 directory called Scan01.mkv.RAWcooked, and inside this will be a Scan01 directory that is exactly the same as your original (see image below to illustrate).

My illustration above of RAWcooking is the most basic operational level.  There are so many other options available to you during wrapping/demuxing, all of which are detailed in the man page (operator manual) and via the RAWcooked help. This can include forcing a framerate option, running partial or full checks, changing the output name of the file, change audio output value from FLAC to PCM etc. To view these in Terminal or Command Prompt type:

rawcooked -h   runs help
man rawcooked   loads the manual. Type q to exit the man programme.

IMG_2934
Two Scan directories, Scan01 and Scan03 having been RAWcooked and demuxed again, now undergoing fixity checks

File fixity checks
There are two stages of MD5 checksum generation in MACE’s workflow. The first stage validates the RAWcooking process, checking demuxed directories are identical to the original DPX scan directories. The second, ‘MD5 checksum generator’ below, generates MD5 checksums of all the directory contents to ensure copying integrity between the Z840 computer and the LTO Tapes.

The first of my RAWcooked validation checks runs a CRC-64 comparison using Windows software 7-Zip’s CRC SHA. The image above illustrates this process rather badly, as you should run it on the directory INSIDE the one I’ve selected, which will have the path Scan03.mkv.RAWcooked/Scan03 and contain the demuxed DPX files.  This way the two comparisons have the same directory names and contents and the CRC won’t fail.  I manually compare the CRC outputs of this process which takes a couple of seconds, and a straight edge helps!

IMG_4005
Framemd5 from a DPX directory

I also run an FFmpeg framemd5 command on the original DPX scan directory and the demuxed directory.  I use the first DPX file in the sequence to run the FFmpeg command, and change the number sequence to a regex value that sets a six digit format ‘%06d’ in place of the numbers 000000. If your scans have 5 or 7 digits in your DPX title then adjust the regex to ‘%05d’ or ‘%07d’. Also, if the DPX sequence you are compressing does not start at 000000, which is a common occurance here at MACE, then you need to add in a ‘-start_number 001234’ call before the ‘-i’ with the first number in the sequence, but leave the regex in place. The following command uses FFmpeg’s framemd5 library to generate an MD5 of each DPX in the sequence and compiles them into a .txt file with header information at the top (see image above).

ffmpeg -i /Volume/path_to_directory/Scan01/DPX23456_%06d.dpx -f framemd5 md5_23456_Scan01.txt

And the command again with a dpx sequence that starts at 001234:

ffmpeg -start_number 001234 -i /Volume/path_to_directory/Scan01/DPX23456_%06d.dpx -f framemd5 md5_23456_Scan01.txt
IMG_3797 2
[Note: If your sequence has any breaks your ffmpeg may fail. When this has been the case I’ve opted to use the additional command -pattern_type glob before -i "*.dpx" which re-orders the DPX files but allows you to skip missing items in the sequence – you wont need the -start_number command either.]

This text file will get dumped where your Terminal working directory is located, so if you’re unsure run the ‘pwd’ command once the .txt file has finished exporting.  I run this command again on the demuxed DPX directory, adding a ‘B’ after the Scan01 in the title of the .txt file to indicate it’s the RAWcooked demuxed version. Next I take these two .txt files and run a file comparison (Windows):

fc md5_23456_Scan01.txt md5_23456_Scan01B.txt

Or a difference check (MacOS):

diff md5_23456_Scan01.txt md5_23456_Scan01B.txt

If everything is okay and the files are identical then you wont receive any output using diff, and fc gives you a short sentence saying there are no differences. The diff output made me nervous for a long time, which is why I instigated the CRC-64 checksum verification as well. However, I’ve experienced enough fails now to know that when it returns nothing it’s really okay!  When I use this comparison for my v210 MOV to FFV1 Matroska conversions I often get this short diff return:

9c9
< #sar 0: 128/117
---

> #sar 0: 16/15

This tells me that there’s a difference on line 9, which is caused by my forcing the Display Aspect Ratio (DAR) from 5:4 (PAR 1.094) to 4:3 (PAR 1.067), resulting in an adjustment to the Storage Aspect Ratio (SAR) shown above. As this is all it returns I know that the framemd5 is fine for the rest of the document.  If a diff command on Mac raises a few issues you can run a further comparison that provides more line by line detail:

cmp md5_23456_Scan01.txt md5_23456_Scan01B.txt

If all comes back as okay I store the framemd5 for the original DPX Scan01 alongside the Matroska for future comparisons, before I delete both this Scan01 directory and the demuxed Scan01 directory.  If at any point I get a checksum fail, or something doesn’t seem right with the conversion to and from FFV1 then you can perform a CRC check on the FFV1 video compression by typing:

ffmpeg -i /path_to_directory/Scan01.mkv -f null -

This fixity checks the codec to ensure there are no errors with it, a really nice function of the FFV1 codec. You can read more about this at FFmprovisr Check FFV1 Fixity.

MD5tool.py checksum generator
I’ve used md5tool.py for sometime now since TAR wrapping DPX directories and writing them to LTO. The script MD5tool.py has two main commands, a ‘generate’ and a ‘check’ MD5.  The script requires a directory input and it will generate (or check) MD5s of all of the files within.  The image below shows md5 files generated by md5tool.py.

Screen Shot 2019-11-08 at 11.58.44

This is really helpful when removing files from LTO in the future, you can copy from tape to your desktop and run the check function on the contents to ensure your copies are accurate, and if you want you can run the check command on the files once they’re sat on LTO.  We’ve never run this check on LTO, as we were advised not to checksum LTO tape in the early days of using them to avoid tape wear. We don’t worry about this since learning more from the LTO archival community and now use Python scripts to copy to LTO and which generate md5 checksums from LTO tape – more later.

To run the script I ‘cd’ into the md5tool-master directory that houses the md5tool.py script.  Once Terminal/Command Prompt is in the directory I type the following command:

python md5tool.py generate /path_to_directory/23456

This simple python script works equally well with both Python 2 and 3, so I don’t need to be explicit about which Python call to make on my MacOS – which has both installed. The MD5s generated will sit next to their originals (as seen above) with the same title and an .md5 added to the end.  This is the last stage before I copy that directory to LTO tape. Once on LTO you could run an MD5 check by typing:

python md5tool.py check /Volumes/LTO_Tape/23456

I’ve been wanting to review and update this process for a while now, as a script like seq2ffv1.py generates so many excellent sidecar files including MD5 checksums — while also RAWcooking your directory — that it should be an unecessary step in the future. But as a basic checksum fixity I’ve found it really user friendly and easy to implement.

Copying to LTO using Python copyit.py
The final step in this process is to copy the finished DPX directory — containing the RAWcooked MKV(s), WAV file(s), project files and all associated MD5 checksums — to your LTO archive tape.  In recent months this copying has seen a change from a simple drag/drop to the tape using LTFS GUI software (read more about MACE’s early LTO/LTFS practises here), to using the IFI Irish Film Archive’s Copyit.py python script. This script scans the DPX directory, assesses available space at the destination LTO, generates an MD5 manifest for the DPX directory, begins copying to LTO, generates an MD5 manifest of the files once on the LTO and then runs a validation check of the two manifests and reports copy success via a Terminal/Command Prompt output.

To run this script I ‘cd’ into the directory holding it, called ifiscripts which I keep in my /users directory. Then I type the following line to start the copy:

python3 copyit.py -l /path_to_directory/23456 /Volumes/LTO_Tape

The script is optimised for python 3, hence the python3 call only necessary if you’ve installed it this way.  The ‘-l’ addition signals that I’d like the copyit script to use GCP instead of R-SYNC for copying to LTO, and is suggested for MacOS copying.  GCP is quicker when writing to LTO and this amendment helps speed up long copying days.  To read more about this excellent little script and how we use it extensively at MACE, check out my earlier blog post on DIY Python Microservices. It provides more information about copyit.py and its functions. I would be bereft without it!

Automating copyit.py
In the last few weeks I’ve been trying to find ways to use bash scripts to automate the copyit.py processes here at MACE. I want to leave the command above to work through a collection of files or directories overnight, or while I’m working on something else. I could do this by moving the whole directory across but this wont handle each archival package individually or generate an item specific MD5 manifest. Also it would leave you with a directory full items on LTO, which hasn’t been our pattern for writing to LTO so far, and I don’t want to move things once on LTO. Even though it’s probably just an index update on the tape I worry that it might damage the files in some way or this additional relocation of files on tape might result in some data loss.

I’ll share the loops I’ve discovered here should you want to try them yourself, but I recommend testing with dummy files before you commit a command to write files to your LTO archive. Thanks to Kieran O’Leary for sharing his bash for looping scripts from an IFI Irish Film Archive Workshop.

To batch automate MKVs or MOVs within a directory I use the recursive loop, find command with -exec call which executes the Python command. I would still cd into the IFIscripts directory containing copyit.py for this. For MacOS Terminal this would be:

find /directory_to_search/ -name "*.mkv" -exec python3 copyit.py -l {} /path_to_LTO_volume/ \;

I haven’t tried the Windows version but I found a Stack Overflow post here that uses the code below, but please test it before use (and let me know how it works out!):

for /R "D:\directory" %I in (*.mkv) do python copyit.py "%I" /path_to_LTO_volume/

To batch automate directories within directories you can use for loops in MacOS Terminal. For this one you need to navigate into the directory you want to loop in, so:

cd /directory_of_directories/

Then you need to remember to specify the path to the IFIscripts directory containing copyit.py:

for dir in */; do python3 /path_to_script/copyit.py -l "$dir" /path_to_LTO/; done

Literally yesterday I managed to achieve a Windows version of this, and my gosh am I glad to have it sorted! This script enables me to leave DPX directories (2TB at a time) copying to LTO with the marvellous copyit.py.  I was troubled for a while by not understanding how to cd into an external volume in Windows but found a post online that explains you have to specify a volume change with Command Prompt before you can cd into a specific directory within that volume. So you just type the volume letter and colon:

T:

Then start the cd command:

cd T:/directory_of_directories/

Finally you can use the for loop below to start the command prompt loop with copyit.py (thanks to Claudio Santancini for providing this Windows alternative):

for /d %d in (*) do python /path_to_script/copyit.py %d /LTO_volume/

With the Windows commands I’ve noticed that if you have any spaces in your path then the drag and drop will be entered in ” ” but when the path has none or _ in place of a space the path will drop in without the speech marks. I try to stick to the latter, because I fear that it will impact on my python call – I don’t know this for certain though so let me know what you find!  Also there’s a neat trick with Windows 10 when you can type cmd into a directory’s path in Windows Explorer and it will launch a command prompt with this directory, as a working directory. From here you might be able to launch the for loops – I’ve not tried it so let me know if it works – read more here.

A few final snags

I thought I’d highlight a few snags that interrupt the fluidity of this workflow here at MACE, they may crop up with you too.

IMG_9997

First, our GE4 scanner likes to dump one or two files in each DPX scan directory (see image above). We raised an issue with the manufacturers Digital Vision and they assisted in removing the imglist.tmp, but we still get perftest.raw in every directory we scan.  RAWcooked currently disapproves of this file due to its size, and if I leave it in the directory when I run an encode then I will get a fail.  The software will begin the “Analyzing files (87%), 80 files/s” assessment, and count until completion then return an Error stating: “../perftest.raw is not small, expected to be an attachment?”.  I currently have an open issue #264 about erroneous items in DPX scan directories having recently had a similar failure, so this issue will probably be resolved in coming months.  For now I have to remember to delete this file prior to each RAWcooking session, and any historical imglist.tmp files too.

IMG_9746Secondly, on more than one occassion I’ve made the mistake of trying to launch a RAWcooking session by dragging and dropping the first DPX inside the Scan directory, instead of the directory itself (this is because I’ll RAWcook immediately after making an FFmpeg framemd5 text file, so I’m still inside the Scan01 directory). This results in the process failing with the statement ‘Input is a file so directory will not be handled as a whole. Confirm that this is what you want to do by adding ” –file” to the command.’  As you can see this in itself isn’t a big problem, but what happens as a result is that you get a dpx.rawcooked.reversibility data file (shown in image above) in your DPX stream. This won’t cause subsequent RAWcooks to fail, but it will effect the checksum validation processes after you’ve cooked and demuxed the files.  I first discovered this when testing seq2ffv1.py, which runs a short 24 frame cook/demux and validation check before running the whole DPX directory – one of it’s many excellent features! A new issue has been raised #269, so again hopefully this will be amended in coming months.

Thank you
I hope this guide will give you some basic details to help start you DPX RAWcooking. If you have any questions drop me a message and I’d be happy to chat. Many thanks to Jérôme Martinez, Kieran O’Leary, Dave Rice, Reto Kromer, Paul Mahol, Claudio Santancini, Stephen McConnachie, Ashley Blewer, Andrew Sergeant and many others (I’m so sorry if I forgot your input!) for their guidance with these open source tools.

More reading / viewing

RAWcooked’s Doc directory has some great information about the software, you can read it on their GitHub here. And for more details, particularly about the supported flavours take a https://mediaarea.net/RAWcooked

Introduction to FFV1 and Matroska for film scans by Kieran O’Leary was my first entry point into learning about RAWcooked and explains it so clearly.

Our manual workflow was largely influenced by Reto Kromer’s Proof of Concept post:
Proof of Concept

Take a look at Ashley Blewer’s training pages, there’s slides about FFV1, Matroska, FFmpeg, Command Line and soooo many more – https://training.ashleyblewer.com/

I spent hours returning to the FFV1 CheatSheet by Peter Bubestinger-Steindl. Also, Peter’s recently created a new checksum called Streamhash, that I’ve yet to test but it’s now on FFmprovisr so take a look!

Dave Rice has a great blog post called Reconsidering Checksums for Audiovisual Preservation that I recommend looking at.

I’ve spent hours watching videos from the MediaArea No Time to Wait Conference on their YouTube channel. There are RAWcooked videos on there, stress testing FFV1, and so many more that you should definitely watch.  And for info about the No Time to Wait4 Conference check out Media Area’s website.  Not long to go now!

And don’t miss:
AMIA Open Source
FFmprovisr
FFmpeg

3 thoughts on “DPX preservation workflow with RAWcooked and fixity checking

Leave a comment