The Case of the Failing Framemd5…

The BFI National Archive SD video tape workflow transcodes and stores captured video tape as FFV1 matroska files, ingesting them into the Digital Preservation Infrastructure for long-term preservation.

FFV1 is a losslessly compressed video codec now a standard for many SD video tape archives around the world. Sadly, producers of non linear editing software have been a little slow implementing support for FFV1. This means there is demand for transcoding FFV1 back to lossless V210 mov for access and editing.

Image from Framemd5 Intro and How To page

If you’re new to framemd5 files

Framemd5 files are generally used to validate lossless transcoded files like FFV1 to V210, or muxing DPX image sequences to FFV1 streams. They are like the standard MD5 checksum, a unique hexadecimal code generated by sum algorithms against a whole file. A framemd5 is a more granular version of the MD5, allowing frame level checksum generation for every frame of an audiovisual file. The framemd5 lists all of the frames and their individual hex codes (see example for stream 0 video only framemd5).

As you’ve probably deduced, this means if there’s a problem in your video file you can exactly pinpoint it, even to a frame’s video or audio stream.

It’s really easy to generate a framemd5 of a mov and mkv video file using FFmpeg:
ffmpeg -i input.mkv -f framemd5 output.mkv.framemd5
ffmpeg -i input.mov -f framemd5 output.mov.framemd5

The audiovisual asset is decoded to rawvideo by the framemd5 command before the checksums are generated. This is how you can compare different lossless codecs against one another.

Then to check if they are identical you can use diff for MacOS or Linux:
diff -s output.mkv.framemd5 output.mov.framemd5

Or for Windows you can use file check:
fc output.mkv.framemd5 output.mov.framemd5

If there’s a difference this command will output the lines where the variation occurs. If they’re identical you’ll receive a nice little message telling you the files are identical.

Help from an #AVpres data detective

So for a little context, recently we’ve been receiving externally captured V210 mov files, which are transcoded at the BFI for preservation storage to FFV1 matroska. Now some of these files are being batch transcoded back to V210 mov using the script batch_transcode_ffv1_v210.py. During early test I was seeing multiple transcodes fail the framemd5 diff comparison with large chunks of the file’s video stream returning mismatches. These failing files all tended to be from one or two specific suppliers, where other supplier’s captures would all pass verification fine.

Reviewing the two video files visually in MPV player or FFplay resulted in identical looking images, with no traceable problems, frame by frame. I must have reviewed both my own FFmpeg command and our V210 to FFV1 transcode scripts a dozen times but couldn’t see any causes there.

Thankfully a short conversation with Dave Rice revealed a potential solution which he was kind enough to explain clearly – and even write up a little code for. He suggested it could be a problem introduced by the capture card technology when the video was first created as a V210 mov. I’ll try to explain it as clearly as I can here in the hope it helps others who may run into this problem in the future.

In Apple’s Technical Document TN2162 it outlines mapping and encoding schemes for QuickTime files, developed from digital video industry specifications such as Rec. ITU-R BT. 601-4. For this example we look to scheme B, where the documentation states: “Y´ is an unsigned integer. Cb and Cr are offset binary integers. Certain Y´, Cb, and Cr component values are reserved as synchronization signals and must not appear in a buffer.”

For 10 bit video, like those tested by my script, these buffer values are 0,1,2,3 and 1020,1021,1022,1023. The document continues: “The writer of a QuickTime image is responsible for omitting these values. The reader of a QuickTime image may assume that they are not present”.

According to Dave Rice: “The remaining component values (e.g., 4-63 and 961-1019 for n=10 bits) accommodate occasional filter undershoot and overshoot in image processing. In some applications, these values are used to carry other information (e.g., transparency).”

So how is this relevant?

FFmpeg follows these guidelines, omitting signals within the value range. Certain capture card technology (Black Magic cards being one) do not follow these guidelines, and so these buffer values can be populated with information. When FFmpeg transcodes an FFV1 to V210 it correctly assumes these values are empty and compresses them so they become lossy, but in the FFV1 they remain lossless.

This becomes a problem when you run a framemd5 comparison and the rawvideo conversions have lossy converted sections in the V210 framemd5 and lossless converted rawvideo sections in the FFV1 framemd5. If, as with my scripts, a positive framemd5 comparison signals a successful transcode then you face a lot of wasted repeat transcoding attempts.

Thankfully when you write a framemd5 command you can use a video filter called lutyuv and crop the y, u and v to just run the framemd5 against the known safe areas of both the FFV1 and the V210. You must run the crop on both to get the image match. Dave’s ffmpeg command solution in full:

    framemd5_mkv = [
        "ffmpeg", "-nostdin",
        "-i", fullpath,
        "-vf", "lutyuv=y=if(gt(val\,1019)\,1019\,if(lt(val\,4)\,4\,val)):u=if(gt(val\,1019)\,1019\,if(lt(val\,4)\,4\,val)):v=if(gt(val\,1019)\,1019\,if(lt(val\,4)\,4\,val))",
        "-f", "framemd5",
        output_mkv
    ]

I can confirm this method works really well with framemd5 files still failing where there is difference in the remaining LUT range. If you want to see this fix in operation in my Python script then take a look at the transcoding repository below for more information.

Framemd5 hashes don’t age well…

While discussing this problem with Dave he further explained that the framemd5 itself is not to be seen as a long-term preservation method, but should rather be used for immediate fixity validation. This is due to its reliance on the pix_fmt function. As FFmpeg versions evolve the pix_fmt function changes, allowing it to work with new or improved video codecs. This means the way the framemd5 converts the codec into rawvideo changes too, resulting in different framemd5 hashes between different FFmpeg versions.

If you’ve been creating framemd5 files for long-term preservation storage and you upgrade your FFmpeg then I’m afraid it’s unlikely you’ll have matching framemd5 checksums – which could be wrongly mistaken for corruption of your file! However, if you keep a versions of FFmpeg installed on a server somewhere that was used for the framemd5 generation, then you can use this to check the long-term fixity health of your files.

Sincere thanks to Dave Rice for his insight into FFmpeg and framemd5 checksums, and allowing me to share this issue, so others don’t have the same mystery to solve.

Transcode scripts now on GitHub

There have been lots of scripts to write in the past several months here at the BFI National Archive and so much new stuff to learn! Amongst my favourite work so far is writing these few transcoding scripts using the amazing open source FFmpeg, MediaInfo and MediaConch softwares. Some of these scripts have been shared to the BFI National Archive GitHub page, where you can read the script descriptions in full in the README.md, and in the script comments:

https://github.com/bfidatadigipres/transcoding

This repository features three python scripts, two bash shell launch scripts and a Mediaconch policy. Two python scripts are for batch conversion of high volumes of video files, with the Python scripts launched concurrently by GNU parallel from a bash launch script. The third Python script transcodes files ad hoc, so it runs just once a day transcoding individual files as it finds them in the watch folder.

They’re all launched from the server’s crontab scheduler, and those that run batch encoding concurrently have flock locks in place to stop accidental overlapping runs of the same scripts. I’m not going to write too much more about how these script work, because you can read all about it at the repository itself.

Special thanks must be given to Katherine Frances Nagels (@knfrances) whose Python guidance set my format for using FFmpeg with Python subprocess calls and to my BFI colleagues for support configuring the FFmpeg commands.

I’m grateful for code review from the #AVpres community, and welcome all feedback! I hope they can help others get writing Python transcode scripts, and if you do test these script please do so safely. They’re available under the MIT Licence.

Links

For the definitive checksum text see Dave Rice’s article Reconsidering the Checksum for Audiovisual Preservation.

FIAF’s FFV1 and Matroska reading List compiled by Stephen McConnachie, Head of Data at the BFI National Archive, is a great starting point to learn about this preservation pairing.

The Python transcoding scripts rely heavily on the the following open source projects which provide amazing tools:
FFmpeg.org
MediaArea.net Mediainfo
MediaArea.net MediaConch

To find out more about framemd5 take a look at this how to from FFmpeg.

If you want to visually compare Framemd5 logs then I recommend Meld GUI, it’s a diff and merge tool for developers and really visually useful.

To get started with FFmpeg and writing scripts for #AVpres then I recommend you look at FFmprovisr, which has been recently overhauled and is looking stunning!

Check out Ashley Blewer’s comprehensive training website. You can enter it from her blog introduction here, which beautifully summarises the financial value of sharing #AVpres knowledge!

There are some really great bash script tutorials available at Reto.ch amongst many other useful things.

Check out Morgan O’Morel’s helpful break down of the BAVC FFV1 to MOV transcoding scripts here.

For any strange tape anomalies, take a look at the AV Artifacts Atlas, a detailed collection (with imagery) of the wonderful weirdness of video.